internationalization using locales achim ruopp. agenda working with multilingual data working with...

21
Internationalizati Internationalizati on on Using Locales Using Locales Achim Ruopp Achim Ruopp

Post on 20-Jan-2016

221 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Internationalization Using Locales Achim Ruopp. Agenda Working with multilingual data Working with multilingual data Language and locale identifiers Language

InternationalizationInternationalizationUsing LocalesUsing Locales

Achim RuoppAchim Ruopp

Page 2: Internationalization Using Locales Achim Ruopp. Agenda Working with multilingual data Working with multilingual data Language and locale identifiers Language

AgendaAgenda

Working with multilingual dataWorking with multilingual data Language and locale identifiersLanguage and locale identifiers Locale DataLocale Data Frameworks for locale supportFrameworks for locale support Ideas/discussion how this could be Ideas/discussion how this could be

used in compling used in compling

Page 3: Internationalization Using Locales Achim Ruopp. Agenda Working with multilingual data Working with multilingual data Language and locale identifiers Language

Not about character encodingNot about character encoding

Read Jeremy’s slides from last Read Jeremy’s slides from last quarterquarter• http://students.washington.edu/jgk/talkshttp://students.washington.edu/jgk/talks

/char-enc/char-encodings.pdf/char-enc/char-encodings.pdf

Use Unicode wherever possibleUse Unicode wherever possible

Page 4: Internationalization Using Locales Achim Ruopp. Agenda Working with multilingual data Working with multilingual data Language and locale identifiers Language

InternationalizationInternationalizationMore than Encoding TextMore than Encoding Text

Where are the word breaks?Where are the word breaks?

คลิ�กปุ่��มเมาส์�ขวาคลิ�กปุ่��มเมาส์�ขวาYour balance is $1234.56... I think.Your balance is $1234.56... I think.

How do I sort these words in French?How do I sort these words in French?• cotecote dimensiondimension• côtecôte coastcoast• cotécoté with dimensionswith dimensions• côtécôté sideside

How do I uppercase this word in Turkish?How do I uppercase this word in Turkish?• istiyorum - İstiyorumistiyorum - İstiyorum

How do I transcribe this text into Latin How do I transcribe this text into Latin characters?characters?• 인수문제를 인수문제를 - in'su'mun'je'reul'- in'su'mun'je'reul'

Page 5: Internationalization Using Locales Achim Ruopp. Agenda Working with multilingual data Working with multilingual data Language and locale identifiers Language

Cultural ConventionsCultural Conventions

What does this date stand for?What does this date stand for?• 3/8/20063/8/2006

What is the currency symbol for What is the currency symbol for Hungary?Hungary?

… … linguistic characteristics of linguistic characteristics of languages and cultural conventions – languages and cultural conventions – a localea locale

Page 6: Internationalization Using Locales Achim Ruopp. Agenda Working with multilingual data Working with multilingual data Language and locale identifiers Language

AgendaAgenda

Working with multilingual dataWorking with multilingual data Language and locale identifiersLanguage and locale identifiers Locale DataLocale Data Frameworks for locale supportFrameworks for locale support Ideas/discussion how this could be Ideas/discussion how this could be

used in compling used in compling

Page 7: Internationalization Using Locales Achim Ruopp. Agenda Working with multilingual data Working with multilingual data Language and locale identifiers Language

Internet Language TagsInternet Language Tags

Used today: RFC 3066 (RFC 1766)Used today: RFC 3066 (RFC 1766)• Generative:ISO 639-1/2 language tag[-ISO Generative:ISO 639-1/2 language tag[-ISO

3166 country tag] 3166 country tag] e.g. fr, en-US, ale-CAe.g. fr, en-US, ale-CA

• Registered with IANA Registered with IANA e.g. no-nyo, zh-Hante.g. no-nyo, zh-Hant

• ExceptionsExceptions x-…x-…

Several problemsSeveral problems• Dependency on ISO standardsDependency on ISO standards• No generative options for dialects etc.No generative options for dialects etc.• RFC3066bis should solve thisRFC3066bis should solve this

Page 8: Internationalization Using Locales Achim Ruopp. Agenda Working with multilingual data Working with multilingual data Language and locale identifiers Language

SIL EtnologueSIL Etnologue

Cataloging all of the world’s 6,912 known Cataloging all of the world’s 6,912 known living languagesliving languages

http://www.ethnologue.com/http://www.ethnologue.com/ Uses ISO/DIS 639-3 3-letter codesUses ISO/DIS 639-3 3-letter codes E.g. Swabian dialect: x-sil-swgE.g. Swabian dialect: x-sil-swg Hope for consolidation with RFC3066 or Hope for consolidation with RFC3066 or

successor once 639-3 becomes full successor once 639-3 becomes full standardstandard

Not so well supported in programming Not so well supported in programming frameworksframeworks

Page 9: Internationalization Using Locales Achim Ruopp. Agenda Working with multilingual data Working with multilingual data Language and locale identifiers Language

AgendaAgenda

Working with multilingual dataWorking with multilingual data Language and locale identifiersLanguage and locale identifiers Locale DataLocale Data Frameworks for locale supportFrameworks for locale support Ideas/discussion how this could be Ideas/discussion how this could be

used in compling used in compling

Page 10: Internationalization Using Locales Achim Ruopp. Agenda Working with multilingual data Working with multilingual data Language and locale identifiers Language

Types of Locale DataTypes of Locale Data

Dates/time formatsDates/time formats Number/currency formatsNumber/currency formats Collation SpecificationCollation Specification

• For sorting and comparisonFor sorting and comparison Translated names for language, region, Translated names for language, region,

script, timezones, currencies,…script, timezones, currencies,… Script and characters used by a languageScript and characters used by a language Measurement SystemMeasurement System Paper sizesPaper sizes ……

Page 11: Internationalization Using Locales Achim Ruopp. Agenda Working with multilingual data Working with multilingual data Language and locale identifiers Language

Common Locale Data RepositoryCommon Locale Data Repository

““The purpose of the Common Locale Data The purpose of the Common Locale Data Repository project is to provide a general Repository project is to provide a general XML format for the exchange of locale XML format for the exchange of locale information for use in application and information for use in application and system development, and to gather, store, system development, and to gather, store, and make available a and make available a commoncommon set of set of locale data generated in that format.”locale data generated in that format.”

http://www.unicode.org/cldr/ http://www.unicode.org/cldr/

Page 12: Internationalization Using Locales Achim Ruopp. Agenda Working with multilingual data Working with multilingual data Language and locale identifiers Language

Common Locale Data RepositoryCommon Locale Data Repository

Collection/vetting processCollection/vetting process• Contributors add/modify dataContributors add/modify data• Reviewed by commiteeReviewed by commitee

Accessible over the webAccessible over the web• Locale Data Markup Language XML Locale Data Markup Language XML

formatformat• E.g. E.g.

http://unicode.org/cldr/data/common/mahttp://unicode.org/cldr/data/common/main/fr.xml in/fr.xml

Page 13: Internationalization Using Locales Achim Ruopp. Agenda Working with multilingual data Working with multilingual data Language and locale identifiers Language

AgendaAgenda

Working with multilingual dataWorking with multilingual data Language and locale identifiersLanguage and locale identifiers Locale DataLocale Data Frameworks for locale supportFrameworks for locale support Ideas/discussion how this could be Ideas/discussion how this could be

used in compling used in compling

Page 14: Internationalization Using Locales Achim Ruopp. Agenda Working with multilingual data Working with multilingual data Language and locale identifiers Language

FrameworksFrameworksPosix LocalePosix Locale

Standard C/C++ libaryStandard C/C++ libary• LC_COLLATE – sorting/comparison LC_COLLATE – sorting/comparison • LC_CTYPE - behavior of character-handling LC_CTYPE - behavior of character-handling • LC_MONETARY - monetary formatting LC_MONETARY - monetary formatting

LC_NUMERIC – numeric formatting LC_NUMERIC – numeric formatting • LC_TIME – date/time formattingLC_TIME – date/time formatting

Used in Un*x systems for command line Used in Un*x systems for command line functions toofunctions too

Results can be platform-dependentResults can be platform-dependent Stable, but feature set stuck in the 1980sStable, but feature set stuck in the 1980s

Page 15: Internationalization Using Locales Achim Ruopp. Agenda Working with multilingual data Working with multilingual data Language and locale identifiers Language

FrameworksFrameworksICU LibraryICU Library

IBM Open Source projectIBM Open Source project Developed originally for the Taligent OS Developed originally for the Taligent OS

project in the late 80s/early 90sproject in the late 80s/early 90s Java and C++ APIsJava and C++ APIs Extensive locale data and APIs to use itExtensive locale data and APIs to use it

• http://www.icu-project.org/cgi-bin/locexp http://www.icu-project.org/cgi-bin/locexp Also includes localization supportAlso includes localization support Everybody (Mac OS X, Java, DB2, Everybody (Mac OS X, Java, DB2,

Mathworks …) is using it Mathworks …) is using it But …But …

Page 16: Internationalization Using Locales Achim Ruopp. Agenda Working with multilingual data Working with multilingual data Language and locale identifiers Language

FrameworksFrameworksMicrosoftMicrosoft

Windows NLS APIWindows NLS API Microsoft .NET Framework Microsoft .NET Framework

System.Globalization namespaceSystem.Globalization namespace Similar set of data to ICUSimilar set of data to ICU

• Vetted by subsidiariesVetted by subsidiaries APIs accessible from all MS APIs accessible from all MS

programming languagesprogramming languages Localization support in different APILocalization support in different API

Page 17: Internationalization Using Locales Achim Ruopp. Agenda Working with multilingual data Working with multilingual data Language and locale identifiers Language

Microsoft demosMicrosoft demos

Culture ExplorerCulture ExplorerMicrosoft Transliteration UtilityMicrosoft Transliteration Utility

Page 18: Internationalization Using Locales Achim Ruopp. Agenda Working with multilingual data Working with multilingual data Language and locale identifiers Language

ExtensibilityExtensibility

What if I don’t find the locale I need?What if I don’t find the locale I need? What if I need to modify some of the What if I need to modify some of the

data?data? ICUICU

• Can create new localesCan create new locales MicrosoftMicrosoft

• .NET Framework v2.0: custom cultures.NET Framework v2.0: custom cultures• Windows Vista: custom localesWindows Vista: custom locales

LDML can be interchange formatLDML can be interchange format

Page 19: Internationalization Using Locales Achim Ruopp. Agenda Working with multilingual data Working with multilingual data Language and locale identifiers Language

AgendaAgenda

Working with multilingual dataWorking with multilingual data Language and locale identifiersLanguage and locale identifiers Locale DataLocale Data Frameworks for locale supportFrameworks for locale support Ideas/discussion how this could be Ideas/discussion how this could be

used in complingused in compling

Page 20: Internationalization Using Locales Achim Ruopp. Agenda Working with multilingual data Working with multilingual data Language and locale identifiers Language

Usages for Usages for Computational LinguisticsComputational Linguistics

Up to the imaginationUp to the imagination• Transliteration use in MTTransliteration use in MT• Named Entity RecognitionNamed Entity Recognition• … … • suggestions?suggestions?

Most importantly: Do not reinvent the Most importantly: Do not reinvent the wheel!wheel!• Check if API or data you need is availableCheck if API or data you need is available

If possible write code in a language/locale-If possible write code in a language/locale-independent fashionindependent fashion

Page 21: Internationalization Using Locales Achim Ruopp. Agenda Working with multilingual data Working with multilingual data Language and locale identifiers Language

ReferencesReferences RFC3066bisRFC3066bis

• http://www.inter-locale.com/ID/why-rfc3066bis.htmlhttp://www.inter-locale.com/ID/why-rfc3066bis.html EtnologueEtnologue

• http://www.ethnologue.com/http://www.ethnologue.com/ Common Locale Data RepositoryCommon Locale Data Repository

• http://www.unicode.org/cldr/http://www.unicode.org/cldr/ Posix LocalePosix Locale

• http://www.opengroup.org/onlinepubs/009695399/basedefs/http://www.opengroup.org/onlinepubs/009695399/basedefs/xbd_chap07.html xbd_chap07.html

ICUICU• http://icu.sourceforge.net/ http://icu.sourceforge.net/

MicrosoftMicrosoft• http://www.microsoft.com/globaldev/ http://www.microsoft.com/globaldev/

UNGEGN Working Group on Romanization Systems UNGEGN Working Group on Romanization Systems • http://www.eki.ee/wgrs/ http://www.eki.ee/wgrs/