a field linguist’s guide to unicode deborah anderson script encoding initiative (universal scripts...

32
A Field Linguist’s A Field Linguist’s Guide to Unicode Guide to Unicode Deborah Anderson Deborah Anderson Script Encoding Initiative Script Encoding Initiative (Universal Scripts Project) (Universal Scripts Project) Dept. of Lings., UC-Berkeley Dept. of Lings., UC-Berkeley LSA Panel: A Field Linguist’s LSA Panel: A Field Linguist’s Guide to Making Long-Lasting Texts Guide to Making Long-Lasting Texts and Databases and Databases January 4, 2007 January 4, 2007

Upload: gregory-fox

Post on 27-Dec-2015

235 views

Category:

Documents


0 download

TRANSCRIPT

A Field Linguist’s Guide A Field Linguist’s Guide to Unicodeto Unicode

Deborah AndersonDeborah Anderson

Script Encoding InitiativeScript Encoding Initiative

(Universal Scripts Project)(Universal Scripts Project)

Dept. of Lings., UC-BerkeleyDept. of Lings., UC-Berkeley

LSA Panel: A Field Linguist’s Guide LSA Panel: A Field Linguist’s Guide to Making Long-Lasting Texts and to Making Long-Lasting Texts and

DatabasesDatabases

January 4, 2007January 4, 2007

Working with Text Working with Text RepresentationRepresentation

““Use Unicode” (ISO/IEC Use Unicode” (ISO/IEC 10646)10646)

Working with Text Working with Text RepresentationRepresentation

““Use Unicode” (ISO/IEC 10646)Use Unicode” (ISO/IEC 10646) Practical issues to consider:Practical issues to consider:

* which Unicode characters?* which Unicode characters?

* what about fonts?* what about fonts?

* how about keyboards?* how about keyboards?

* will the language be * will the language be supported in supported in off-the-shelf off-the-shelf software?software?

Working with Text Working with Text RepresentationRepresentation

Goal today is to discuss the Goal today is to discuss the whole process of enabling a whole process of enabling a language to be used on a language to be used on a computer:computer:

identifying letters/symbols in Unicodeidentifying letters/symbols in Unicode fontsfonts keyboardskeyboards how to get support for the characters how to get support for the characters and scripts in softwareand scripts in software

List all letters, symbols, List all letters, symbols, digits, and marks of digits, and marks of punctuation used in a punctuation used in a languagelanguage

Step 1. Identify the characters used in a

language

Step 1. Identify the characters used in a

language

One proposal for the Kazym Khanty alphabet

List all letters, symbols, List all letters, symbols, digits, marks of punctuation digits, marks of punctuation used in a languageused in a language

Assign Unicode codepointsAssign Unicode codepoints

Step 1. Identify the characters used in a

language

http://www.tlg.uci.edu/quickbeta.pdf

List all letters, symbols, List all letters, symbols, digits, marks of punctuation digits, marks of punctuation used in a languageused in a language

Assign Unicode codepointsAssign Unicode codepoints Post a plain text version on a Post a plain text version on a

publicly accessible websitepublicly accessible website Circulate this list for Circulate this list for

commentcomment

Step 1. Identify the characters used in a

language

Questions on which Unicode Questions on which Unicode characters to use?characters to use?

Check codecharts on the Unicode Check codecharts on the Unicode websitewebsite

Step 1. Identify the characters used in a

language

Step 1. Identify the characters used in a

language Questions on which Unicode Questions on which Unicode

characters?characters? Check codecharts on the Unicode Check codecharts on the Unicode

websitewebsite Check nameslist and annotationsCheck nameslist and annotations

Questions on which Unicode Questions on which Unicode characters? characters?

Check codecharts on the Unicode Check codecharts on the Unicode websitewebsite

Check nameslist and annotationsCheck nameslist and annotations Not in Unicode charts? See if it is Not in Unicode charts? See if it is

on the “Pipeline” page on the on the “Pipeline” page on the website for new characterswebsite for new characters

Step 1. Identify the characters used in a

language

Step 1. Identify the characters used in a

language

http://www.unicode.org/alloc/Pipeline.html

Questions on which Unicode Questions on which Unicode characters?characters?

Check codecharts on the Unicode Check codecharts on the Unicode websitewebsite

Check nameslist and annotationsCheck nameslist and annotations Not in Unicode charts? See if it is Not in Unicode charts? See if it is

in the “Pipeline” page on the in the “Pipeline” page on the website for new characterswebsite for new characters

Unsure? Ask on Unicode email listUnsure? Ask on Unicode email list

Step 1. Identify the characters used in a

language

Propose any missing Propose any missing characters for inclusion into characters for inclusion into the Unicode Standardthe Unicode Standard

Step 1. Identify the characters used in a

language

Propose any missing Propose any missing characters for inclusion into characters for inclusion into the Unicode Standardthe Unicode Standard TIP: Apply for funding to write a TIP: Apply for funding to write a

Unicode proposal or to conduct Unicode proposal or to conduct researchresearch

Step 1. Identify the characters used in a

language

Propose any missing Propose any missing characters for inclusion into characters for inclusion into the Unicode Standardthe Unicode Standard TIP: Apply for funding to write a TIP: Apply for funding to write a

Unicode proposal or to conduct Unicode proposal or to conduct researchresearch

TIP: Allow enough time for TIP: Allow enough time for writing and review of proposalwriting and review of proposal

Step 1. Identify the characters used in a

language

Propose any missing characters Propose any missing characters for inclusion into the Unicode for inclusion into the Unicode StandardStandard TIP: Apply for funding to write a TIP: Apply for funding to write a

proposal or to conduct researchproposal or to conduct research TIP: Allow enough time for writing and TIP: Allow enough time for writing and

review of proposalreview of proposal Note: Once written, the proposal will Note: Once written, the proposal will

take 2-5 years to get through take 2-5 years to get through standards bodiesstandards bodies

Step 1. Identify the characters used in a

language

For languages without an For languages without an orthography, consult orthography, consult Unicode Technical Note #19 Unicode Technical Note #19 ::

Step 1. Identify the characters used in a

language

http://www.unicode.org/notes/tn19/http://www.unicode.org/notes/tn19/

From Unicode Technical Note #19:From Unicode Technical Note #19: If at all possible, use an already If at all possible, use an already

encoded character, abiding by the encoded character, abiding by the following tips:following tips:

If the script is right-to-left, select a If the script is right-to-left, select a character that is from a script that is character that is from a script that is right-to-leftright-to-left

Avoid “presentation forms” or “letterlike Avoid “presentation forms” or “letterlike characters”characters”

For a punctuation mark, select a For a punctuation mark, select a character from the general punctuation character from the general punctuation block. block.

Step 1. Identify the characters used in a

language

http://www.unicode.org/notes/tn19/http://www.unicode.org/notes/tn19/

Step 2: Send locale data Step 2: Send locale data to CLDR project to CLDR project

Locales: local conventions Locales: local conventions used to create software that used to create software that is tailored to a specific is tailored to a specific language and locationlanguage and location Currency ($, £, etc.)Currency ($, £, etc.) Time/date formats, measurement Time/date formats, measurement

systems (i.e., France: 902 300, systems (i.e., France: 902 300, Germany: 902.300, U.S.: 902,300)Germany: 902.300, U.S.: 902,300)

Sorting orderSorting order

Step 2: Send locale data Step 2: Send locale data to CLDR projectto CLDR project

Common Locale Data Project: Common Locale Data Project: project hosted by Unicode that project hosted by Unicode that makes locale info freely available for makes locale info freely available for software developers and others. software developers and others.

http://www.unicode.org/cldr/

Step 2: Send locale data Step 2: Send locale data to CLDR projectto CLDR project

Step 2: Send locale data Step 2: Send locale data to CLDR project to CLDR project

TIP: Involve a member of the user TIP: Involve a member of the user community to submit locale datacommunity to submit locale data

Step 3: Create a fontStep 3: Create a font

Once a list of all the letters and Once a list of all the letters and symbols has been created with Unicode symbols has been created with Unicode values, work can begin on a fontvalues, work can begin on a font

If any characters are being proposed, If any characters are being proposed, wait until they are far along in the wait until they are far along in the standards process standards process

Tip: Apply for funding to create a freely Tip: Apply for funding to create a freely available font; costs can run $100/glyphavailable font; costs can run $100/glyph

Step 3: Create a fontStep 3: Create a font

It is recommended to use someone It is recommended to use someone familiar with the script and familiar with the script and computer typography (esp. for computer typography (esp. for complex scripts)complex scripts)

Use FontLabUse FontLab

Step 4: Rendering Step 4: Rendering Engines for complex Engines for complex scripts need upgradescripts need upgrade

For new complex scripts (e.g., bidi For new complex scripts (e.g., bidi issues, complex ligatures), upgrades to issues, complex ligatures), upgrades to the rendering engine are often needed the rendering engine are often needed in order to properly draw the glyphs. in order to properly draw the glyphs.

Early contact with companies Early contact with companies (Microsoft and Adobe), the Linux (Microsoft and Adobe), the Linux community, and SIL is advised so the community, and SIL is advised so the rendering engine can support the rendering engine can support the script properlyscript properly

Examples of Complex Scripts

N’Ko

Javanese

Step 4: Rendering Step 4: Rendering Engines for complex Engines for complex scripts need upgradescripts need upgrade

SIL’s Graphite rendering engine SIL’s Graphite rendering engine offers a good test environmentoffers a good test environment

Generally Apple does not require Generally Apple does not require upgrades to its rendering engineupgrades to its rendering engine

Microsoft prioritizes which scripts are Microsoft prioritizes which scripts are included in its next rendering engine; included in its next rendering engine; governmental support is helpful in governmental support is helpful in making a case to MSmaking a case to MS

Step 5: Create a Step 5: Create a KeyboardKeyboard

There are a number of keyboard There are a number of keyboard creation programs that are available, creation programs that are available, including: including: Keyman (for Windows)Keyman (for Windows) Microsoft Keyboard Layout Creator Microsoft Keyboard Layout Creator

(“MKLC”)(“MKLC”) Ukelele (for the Mac) Ukelele (for the Mac) Keyboard Mapping for LinuxKeyboard Mapping for Linux

Step 5: Create a Step 5: Create a KeyboardKeyboard

Make the keyboard layout practical Make the keyboard layout practical and have the user community test it and have the user community test it out.out.

Make the keyboard layout freely Make the keyboard layout freely available on (such as on available on (such as on Tavultesoft’s website)Tavultesoft’s website)

Conclusion Getting support for a language on the

computer can be a long process, especially for new complex scripts, but the payoff is significant. Patience and persistence are key.

Avoid promising immediate access to a given language on the computer (unless all the characters are already encoded and available in widely used fonts)

Raising funding to cover all parts of the process from encoding to fonts is still an issue: Balinese needs fonts, N’Ko needs rendering engine support.

Unicode website: http://www.unicode.org

Script Encoding Initiative: http://linguistics.berkeley.edu/sei