the script encoding initiative e-meld august 4, 2002 deborah anderson, dept. of linguistics, uc...

The Script Encoding Initiative

E-MELDAugust 4, 2002Deborah Anderson, Dept. of Linguistics, UC Berkeley

Levels of Representation

BASE FLOOR: Character Encoding

Example: A with a ring above = hex C5

Levels of Representation

UPPER LEVEL: Higher level markup: HTML, XML, TEI, other tagsets

BASE FLOOR: Character Encoding

Problem

Alcestis’ Euripides:

á¼ŒÎ´Î¼Î·Î¸', á½Ïá¾·Ï‚ Î³á½°Ï Ï„á¼€Î¼á½° � � �Ï€Ïá½±Î³Î¼Î±Î¸' á½¡Ï‚ á¼”Ï‡ÎµÎ¹,�

Background to the Problem

Different countries and/or vendors had their own character encoding systems

Interoperability was low or non-existent Fonts were created using non-standard

(ad hoc) character encodings

Unicode (www.unicode.org)

Unicode Fully synchronized with ISO 10646 Unicode aims to be universal. Unicode 3.2 now has nearly 95,200

encoded characters. Those code points above the first “plane” (Basic Multilingual Plane) are encoded with pairs of 16-bit units. This “surrogate” technique allows over 1 million characters to be encoded.

Character Proposal Process

Must pass two standards bodies:1. Unicode Technical Committee2. ISO WG2 (ISO/IEC JTC1/SC2/WG2)

Proposals require close review Needs letters of support from the user

community (scholars/modern speakers) Process from first proposal until final passage

takes approx. 3-5 years

Current Situation

52 scripts are now covered, but 90+ remain (primarily historic and minority scripts)

Work has been almost entirely voluntary to date Outstanding script proposals will require

substantial scholarly input and contact with modern speakers (for minority scripts)

Less commercial interest in the remaining scripts

Scripts Missing from Unicode (http://www.unicode.org/sei/alpha-script-list.html)

MINORITY SCRIPTS New Tai Lue Lepcha N'ko Ol Cemet' Meithei/Manipuri Pahawh Hmong Cham Saurashtra Lanna Tifinagh Chakma

HISTORIC SCRIPTS Egyptian hieroglyphs Sumero-Akkadian

cuneiform Phoenician Carian Lycian Luwian Aztec pictographs Mayan hieroglyphs Avestan Old Persian cuneiform

Sample Unicode Proposal

Proposals contain: background to the script for the general user and

implementer a description of the characters’ properties sample from running texts inventory of the characters of a script:

a graphic representation (a “glyph”) and a name

a list of recent reference works.

Solution: Script Encoding Initiative

April 2002 started the SEI at UC Berkeley in conjunction with the Unicode VP

Will fund Unicode proposal authors and font creators

Proposals are screened by Unicode VP Brings the university into the international

character encoding standards process Help promote Unicode, which will assure

longevity and stability for linguistic data

Results to Date

$5,000 seed-funding from an anonymous donor No corporate funding forthcoming;

still in need of stable funding base Have already set priorities for scripts, ranking

minority scripts higher We have created a preliminary website listing

the scripts, so experts can look at the list and see the current status of proposals.

Impact of Project

Allow minority populations to access their own script for communication with others and the world

Permit online courseware in these scripts Allow academic scholars to use the script in their

research and online publications Provide assistance for linguists working on those

languages currently without a script Helps build a storehouse of information on the

world’s scripts

Needs

Funding: Applied for NEH, but this requires raising matching funds

Participation from linguists to answer questions and identify problems/missing characters

Recommend linguists might want to include a line-item for Unicode proposals in their government grant proposals

How to Help

Contact Script Encoding Initiative:[email protected]

Website:www.linguistics.berkeley.edu/~dwanders

Unicode website: www.unicode.org

Problem: How to handle characters now outside Unicode

Results of a TEI Working Group meeting:1. use Private Use Area

2. use entities

Other alternatives: Continue to use transcription/transliteration schemes (with ASCII) or non-standard encodings

the script encoding initiative e-meld august 4, 2002 deborah anderson, dept. of linguistics, uc...

Documents