the script encoding initiative e-meld august 4, 2002 deborah anderson, dept. of linguistics, uc...
TRANSCRIPT
The Script Encoding Initiative
E-MELDAugust 4, 2002Deborah Anderson, Dept. of Linguistics, UC Berkeley
Levels of Representation
BASE FLOOR: Character Encoding
Example: A with a ring above = hex C5
Levels of Representation
UPPER LEVEL: Higher level markup: HTML, XML, TEI, other tagsets
BASE FLOOR: Character Encoding
Problem
Alcestis’ Euripides:
Ἄδμηθ', á½Ïá¾·Ï‚ Î³á½°Ï Ï„á¼€Î¼á½° � � �Ï€Ïάγμαθ' ὡς ἔχει,�
Background to the Problem
Different countries and/or vendors had their own character encoding systems
Interoperability was low or non-existent Fonts were created using non-standard
(ad hoc) character encodings
Unicode (www.unicode.org)
Unicode Fully synchronized with ISO 10646 Unicode aims to be universal. Unicode 3.2 now has nearly 95,200
encoded characters. Those code points above the first “plane” (Basic Multilingual Plane) are encoded with pairs of 16-bit units. This “surrogate” technique allows over 1 million characters to be encoded.
Character Proposal Process
Must pass two standards bodies:1. Unicode Technical Committee2. ISO WG2 (ISO/IEC JTC1/SC2/WG2)
Proposals require close review Needs letters of support from the user
community (scholars/modern speakers) Process from first proposal until final passage
takes approx. 3-5 years
Current Situation
52 scripts are now covered, but 90+ remain (primarily historic and minority scripts)
Work has been almost entirely voluntary to date Outstanding script proposals will require
substantial scholarly input and contact with modern speakers (for minority scripts)
Less commercial interest in the remaining scripts
Scripts Missing from Unicode (http://www.unicode.org/sei/alpha-script-list.html)
MINORITY SCRIPTS New Tai Lue Lepcha N'ko Ol Cemet' Meithei/Manipuri Pahawh Hmong Cham Saurashtra Lanna Tifinagh Chakma
HISTORIC SCRIPTS Egyptian hieroglyphs Sumero-Akkadian
cuneiform Phoenician Carian Lycian Luwian Aztec pictographs Mayan hieroglyphs Avestan Old Persian cuneiform
Sample Unicode Proposal
Proposals contain: background to the script for the general user and
implementer a description of the characters’ properties sample from running texts inventory of the characters of a script:
a graphic representation (a “glyph”) and a name
a list of recent reference works.
Solution: Script Encoding Initiative
April 2002 started the SEI at UC Berkeley in conjunction with the Unicode VP
Will fund Unicode proposal authors and font creators
Proposals are screened by Unicode VP Brings the university into the international
character encoding standards process Help promote Unicode, which will assure
longevity and stability for linguistic data
Results to Date
$5,000 seed-funding from an anonymous donor No corporate funding forthcoming;
still in need of stable funding base Have already set priorities for scripts, ranking
minority scripts higher We have created a preliminary website listing
the scripts, so experts can look at the list and see the current status of proposals.
Impact of Project
Allow minority populations to access their own script for communication with others and the world
Permit online courseware in these scripts Allow academic scholars to use the script in their
research and online publications Provide assistance for linguists working on those
languages currently without a script Helps build a storehouse of information on the
world’s scripts
Needs
Funding: Applied for NEH, but this requires raising matching funds
Participation from linguists to answer questions and identify problems/missing characters
Recommend linguists might want to include a line-item for Unicode proposals in their government grant proposals
How to Help
Contact Script Encoding Initiative:[email protected]
Website:www.linguistics.berkeley.edu/~dwanders
Unicode website: www.unicode.org
Problem: How to handle characters now outside Unicode
Results of a TEI Working Group meeting:1. use Private Use Area
2. use entities
Other alternatives: Continue to use transcription/transliteration schemes (with ASCII) or non-standard encodings