a field linguist’s guide to unicode deborah anderson script encoding initiative (universal scripts...
TRANSCRIPT
A Field Linguist’s Guide A Field Linguist’s Guide to Unicodeto Unicode
Deborah AndersonDeborah Anderson
Script Encoding InitiativeScript Encoding Initiative
(Universal Scripts Project)(Universal Scripts Project)
Dept. of Lings., UC-BerkeleyDept. of Lings., UC-Berkeley
LSA Panel: A Field Linguist’s Guide LSA Panel: A Field Linguist’s Guide to Making Long-Lasting Texts and to Making Long-Lasting Texts and
DatabasesDatabases
January 4, 2007January 4, 2007
Working with Text Working with Text RepresentationRepresentation
““Use Unicode” (ISO/IEC Use Unicode” (ISO/IEC 10646)10646)
Working with Text Working with Text RepresentationRepresentation
““Use Unicode” (ISO/IEC 10646)Use Unicode” (ISO/IEC 10646) Practical issues to consider:Practical issues to consider:
* which Unicode characters?* which Unicode characters?
* what about fonts?* what about fonts?
* how about keyboards?* how about keyboards?
* will the language be * will the language be supported in supported in off-the-shelf off-the-shelf software?software?
Working with Text Working with Text RepresentationRepresentation
Goal today is to discuss the Goal today is to discuss the whole process of enabling a whole process of enabling a language to be used on a language to be used on a computer:computer:
identifying letters/symbols in Unicodeidentifying letters/symbols in Unicode fontsfonts keyboardskeyboards how to get support for the characters how to get support for the characters and scripts in softwareand scripts in software
List all letters, symbols, List all letters, symbols, digits, and marks of digits, and marks of punctuation used in a punctuation used in a languagelanguage
Step 1. Identify the characters used in a
language
List all letters, symbols, List all letters, symbols, digits, marks of punctuation digits, marks of punctuation used in a languageused in a language
Assign Unicode codepointsAssign Unicode codepoints
Step 1. Identify the characters used in a
language
http://www.tlg.uci.edu/quickbeta.pdf
List all letters, symbols, List all letters, symbols, digits, marks of punctuation digits, marks of punctuation used in a languageused in a language
Assign Unicode codepointsAssign Unicode codepoints Post a plain text version on a Post a plain text version on a
publicly accessible websitepublicly accessible website Circulate this list for Circulate this list for
commentcomment
Step 1. Identify the characters used in a
language
Questions on which Unicode Questions on which Unicode characters to use?characters to use?
Check codecharts on the Unicode Check codecharts on the Unicode websitewebsite
Step 1. Identify the characters used in a
language
Step 1. Identify the characters used in a
language Questions on which Unicode Questions on which Unicode
characters?characters? Check codecharts on the Unicode Check codecharts on the Unicode
websitewebsite Check nameslist and annotationsCheck nameslist and annotations
Questions on which Unicode Questions on which Unicode characters? characters?
Check codecharts on the Unicode Check codecharts on the Unicode websitewebsite
Check nameslist and annotationsCheck nameslist and annotations Not in Unicode charts? See if it is Not in Unicode charts? See if it is
on the “Pipeline” page on the on the “Pipeline” page on the website for new characterswebsite for new characters
Step 1. Identify the characters used in a
language
Questions on which Unicode Questions on which Unicode characters?characters?
Check codecharts on the Unicode Check codecharts on the Unicode websitewebsite
Check nameslist and annotationsCheck nameslist and annotations Not in Unicode charts? See if it is Not in Unicode charts? See if it is
in the “Pipeline” page on the in the “Pipeline” page on the website for new characterswebsite for new characters
Unsure? Ask on Unicode email listUnsure? Ask on Unicode email list
Step 1. Identify the characters used in a
language
Propose any missing Propose any missing characters for inclusion into characters for inclusion into the Unicode Standardthe Unicode Standard
Step 1. Identify the characters used in a
language
Propose any missing Propose any missing characters for inclusion into characters for inclusion into the Unicode Standardthe Unicode Standard TIP: Apply for funding to write a TIP: Apply for funding to write a
Unicode proposal or to conduct Unicode proposal or to conduct researchresearch
Step 1. Identify the characters used in a
language
Propose any missing Propose any missing characters for inclusion into characters for inclusion into the Unicode Standardthe Unicode Standard TIP: Apply for funding to write a TIP: Apply for funding to write a
Unicode proposal or to conduct Unicode proposal or to conduct researchresearch
TIP: Allow enough time for TIP: Allow enough time for writing and review of proposalwriting and review of proposal
Step 1. Identify the characters used in a
language
Propose any missing characters Propose any missing characters for inclusion into the Unicode for inclusion into the Unicode StandardStandard TIP: Apply for funding to write a TIP: Apply for funding to write a
proposal or to conduct researchproposal or to conduct research TIP: Allow enough time for writing and TIP: Allow enough time for writing and
review of proposalreview of proposal Note: Once written, the proposal will Note: Once written, the proposal will
take 2-5 years to get through take 2-5 years to get through standards bodiesstandards bodies
Step 1. Identify the characters used in a
language
For languages without an For languages without an orthography, consult orthography, consult Unicode Technical Note #19 Unicode Technical Note #19 ::
Step 1. Identify the characters used in a
language
http://www.unicode.org/notes/tn19/http://www.unicode.org/notes/tn19/
From Unicode Technical Note #19:From Unicode Technical Note #19: If at all possible, use an already If at all possible, use an already
encoded character, abiding by the encoded character, abiding by the following tips:following tips:
If the script is right-to-left, select a If the script is right-to-left, select a character that is from a script that is character that is from a script that is right-to-leftright-to-left
Avoid “presentation forms” or “letterlike Avoid “presentation forms” or “letterlike characters”characters”
For a punctuation mark, select a For a punctuation mark, select a character from the general punctuation character from the general punctuation block. block.
Step 1. Identify the characters used in a
language
http://www.unicode.org/notes/tn19/http://www.unicode.org/notes/tn19/
Step 2: Send locale data Step 2: Send locale data to CLDR project to CLDR project
Locales: local conventions Locales: local conventions used to create software that used to create software that is tailored to a specific is tailored to a specific language and locationlanguage and location Currency ($, £, etc.)Currency ($, £, etc.) Time/date formats, measurement Time/date formats, measurement
systems (i.e., France: 902 300, systems (i.e., France: 902 300, Germany: 902.300, U.S.: 902,300)Germany: 902.300, U.S.: 902,300)
Sorting orderSorting order
Step 2: Send locale data Step 2: Send locale data to CLDR projectto CLDR project
Common Locale Data Project: Common Locale Data Project: project hosted by Unicode that project hosted by Unicode that makes locale info freely available for makes locale info freely available for software developers and others. software developers and others.
http://www.unicode.org/cldr/
Step 2: Send locale data Step 2: Send locale data to CLDR project to CLDR project
TIP: Involve a member of the user TIP: Involve a member of the user community to submit locale datacommunity to submit locale data
Step 3: Create a fontStep 3: Create a font
Once a list of all the letters and Once a list of all the letters and symbols has been created with Unicode symbols has been created with Unicode values, work can begin on a fontvalues, work can begin on a font
If any characters are being proposed, If any characters are being proposed, wait until they are far along in the wait until they are far along in the standards process standards process
Tip: Apply for funding to create a freely Tip: Apply for funding to create a freely available font; costs can run $100/glyphavailable font; costs can run $100/glyph
Step 3: Create a fontStep 3: Create a font
It is recommended to use someone It is recommended to use someone familiar with the script and familiar with the script and computer typography (esp. for computer typography (esp. for complex scripts)complex scripts)
Use FontLabUse FontLab
Step 4: Rendering Step 4: Rendering Engines for complex Engines for complex scripts need upgradescripts need upgrade
For new complex scripts (e.g., bidi For new complex scripts (e.g., bidi issues, complex ligatures), upgrades to issues, complex ligatures), upgrades to the rendering engine are often needed the rendering engine are often needed in order to properly draw the glyphs. in order to properly draw the glyphs.
Early contact with companies Early contact with companies (Microsoft and Adobe), the Linux (Microsoft and Adobe), the Linux community, and SIL is advised so the community, and SIL is advised so the rendering engine can support the rendering engine can support the script properlyscript properly
Step 4: Rendering Step 4: Rendering Engines for complex Engines for complex scripts need upgradescripts need upgrade
SIL’s Graphite rendering engine SIL’s Graphite rendering engine offers a good test environmentoffers a good test environment
Generally Apple does not require Generally Apple does not require upgrades to its rendering engineupgrades to its rendering engine
Microsoft prioritizes which scripts are Microsoft prioritizes which scripts are included in its next rendering engine; included in its next rendering engine; governmental support is helpful in governmental support is helpful in making a case to MSmaking a case to MS
Step 5: Create a Step 5: Create a KeyboardKeyboard
There are a number of keyboard There are a number of keyboard creation programs that are available, creation programs that are available, including: including: Keyman (for Windows)Keyman (for Windows) Microsoft Keyboard Layout Creator Microsoft Keyboard Layout Creator
(“MKLC”)(“MKLC”) Ukelele (for the Mac) Ukelele (for the Mac) Keyboard Mapping for LinuxKeyboard Mapping for Linux
Step 5: Create a Step 5: Create a KeyboardKeyboard
Make the keyboard layout practical Make the keyboard layout practical and have the user community test it and have the user community test it out.out.
Make the keyboard layout freely Make the keyboard layout freely available on (such as on available on (such as on Tavultesoft’s website)Tavultesoft’s website)
Conclusion Getting support for a language on the
computer can be a long process, especially for new complex scripts, but the payoff is significant. Patience and persistence are key.
Avoid promising immediate access to a given language on the computer (unless all the characters are already encoded and available in widely used fonts)
Raising funding to cover all parts of the process from encoding to fonts is still an issue: Balinese needs fonts, N’Ko needs rendering engine support.