[dcsb] torsten roeder (bbaw), yury arzhanov (ruhruniversität bochum) "the glossarium...
DESCRIPTION
TRANSCRIPT
The Glossarium Graeco-Arabicum Linguistic Research and Database Design in Polyalphabetic Environments
Torsten Roeder (BBAW), Yury Arzhanov (Ruhr Universität Bochum)
Ms. Paris BnF 5847, f. 5: Muslim scholars in discussion.
Arabic translation of Dioscurides’ Materia medica (Ibn al-%D\৬ƗU�� al-ۛƗPLޏ�OL-PXIUDGƗW�al-adwiya wa-l-aghdhiya, ۜX]4–�1ގ. %njOƗT�1291 H.)
Filecards for the Greek and Arabic Lexicon (GALex)
GALex
The Database Glossarium Græco-Arabicum
The Glossarium Graeco-Arabicum makes available information in the following fields of research:
• the vocabulary and syntax of Classical and Middle Arabic;
• the development of a scientific and technical vocabulary in Arabic;
• the vocabulary of Classical and Middle Greek;
• the chronology and nature of the translation movement into Arabic;
• the establishment of the texts of Greek works and their Arabic translations.
The Glossarium Graeco-Arabicum online:
November 2013
Glossarium Graeco-ArabicumTelota
BERLIN-BRANDENBURGISCHEAKADEMIE DER WISSENSCHAFTEN
November 2013
Glossarium Graeco-ArabicumTelota
I Technical Challenges→ polyalphabetic environment
II Scholarly Requirements→ linguistic database
III Technical vs. Scholarly→ concluding discussion
OUTLINE
November 2013
Glossarium Graeco-ArabicumTelota
1 Languages Used in the GlossGA Interface
2 Unicode Character Corpus
3 Areas of Technical Challenges
4 Examples
I. TECHNICAL CHALLENGES
November 2013
Glossarium Graeco-ArabicumTelota
Languages used within the project:
Ancient Greek Medieval Arabic Modern English
Greek alphabet Arabic alphabet Latin alphabet
3 layers of diacritics optional vowel signs 1 layer of diacritics
LTR (left to right) RTL LTR
I.1. LANGUAGES
November 2013
Glossarium Graeco-ArabicumTelota
Unicode Chart Range Description
C0 Controls and Basic Latin 0000-007F Latin AlphabetLatin Extended-A 0100-017F transliteration symbolsLatin Extended-Additional 1E00-1EFF transliteration symbols
Greek and Coptic 0370-03FF Greek AlphabetGreek Extended 1F00-1FFF Greek Diacritics
Arabic 0600-06FF Arabic AlphabetArabic Supplement 0750-077F Arabic AlphabetSpacing Modifier Letters 02B0-02FF special Arabic characters
→ in total: about 450 different characters from eight different charts
I.2. UNICODE
November 2013
Glossarium Graeco-ArabicumTelota
Requirements:
1. Data input in all three alphabets with all vowels and diacritics → How to implement a comfortable interface?
2. Simultaneous display of texts in three alphabets and two directions → How to implement concurrent writing directions?
3. Search for terms, insensitive for diacritics or vowels → How to implement queries with different collation sets?
I.3. REQUIREMENTS
November 2013
Glossarium Graeco-ArabicumTelota
a Data Input
b Writing Directions
c Search
d Search Terms
I.4. EXAMPLES
November 2013
Glossarium Graeco-ArabicumTelota
ʾ ˒ ʿ ˓I.4.a. DATA INPUT
November 2013
Glossarium Graeco-ArabicumTelota
U+02BE MODIFIER LETTER RIGHT HALF RING
transliteration of Arabic hamza
U+02D2 MODIFIER LETTER CENTRED RIGHT HALF RING
more rounded articulation
U+02BF MODIFIER LETTER LEFT HALF RING
transliteration of Arabic ain
U+02D3 MODIFIER LETTER CENTRED LEFT HALF RING
less rounded articulation
I.4.a. DATA INPUT
[ʾ]
[˒]
[ʿ]
[˓]
November 2013
Glossarium Graeco-ArabicumTelota
Problem: Appearance vs. Encoding
Users will normally choose charaters …
→ not because of their unicode description→ but because of their appearance
How to bring Unicode to the user?
I.4.a. DATA INPUT
November 2013
Glossarium Graeco-ArabicumTelota
Solutions:
– restrict the characters accepted by the database→ safe, but required validation methods
– provide a virtual keyboard (onscreen)→ user-friendly
Alternative methods:
– beta code→ less recommendable from unicode point of view→ but widely used
I.4.a. DATA INPUT
November 2013
Glossarium Graeco-ArabicumTelota
Phenomenon:
Home (THEN) Arabic Glossary (THEN) ص (THEN) صحة
becomes
Home > Arabic Glossary > ص> صحة
I.4.b. WRITING DIRECTIONS
November 2013
Glossarium Graeco-ArabicumTelota
Problem: Strong vs. Weak Characters
In Unicode, alphabetic characters are usually
STRONG CHARACTERS
which determine the writing direction,
while punctuation characters are usually
WEAK CHARACTERS
which do not change the writing direction.
→ relevant in:
comma separated lists, bibliographic references,
breadcrumb lines, table alignments …
I.4.b. WRITING DIRECTIONS
November 2013
Glossarium Graeco-ArabicumTelota
Solutions:
– insert a ”strong whitespace”:Unicodes U+200E (left to right) or U+200F (right to left)
– or, if in HTML, set the writing direction directly:<span dir="ltr">…</span>
I.4.b. WRITING DIRECTIONS
November 2013
Glossarium Graeco-ArabicumTelota
GREEK ARABIC ENGLISH
diacritics vowel signs diacriticsnot distinct not distinct distinct
requirement: requirement: requirement:η finds also ἠ ἦ ἥ ب8بfinds also 7 سبب س8 d does not find ḏ
Problem: Distinction vs. Collation
I.4.c. SEARCH
November 2013
Glossarium Graeco-ArabicumTelota
Solution:
Greek Arabic English
Greek collation Arabic collation Latin collation
Collation Charts: <http://unicode.org/charts/uca/>
Restrictions:
– does not work for mixed texts→ data needs to be separated
– some environments do not support Arabic vowel collation→ e.g. MySQL <6.0
I.4.c. SEARCH
November 2013
Glossarium Graeco-ArabicumTelota
Phenomenon:
– user searches for Arabic words starting with مل– truncation sysmbol (asterisk) appears at the wrong side
*ملProblem: Neutral Writing Direction
– the standard asterisk is a NEUTRAL CHARACTER– it adapts the main writing direction
I.4.d. SEARCH TERMS
November 2013
Glossarium Graeco-ArabicumTelota
Solution:
Unicode Arabic Asterisk (U+066D), right-to-left
مل٭
I.4.d. SEARCH TERMS
November 2013
Glossarium Graeco-ArabicumTelota
Challenges for the Developer:
– Unicode does not provide general truncation or joker symbols
– different asterisk and joker signs must be processed
– no standard solution available
I.4.d. SEARCH TERMS
November 2013
Glossarium Graeco-ArabicumTelota
Technical Recommendations for Polyalphabetic Environments
– use software components that supports unicode thoughout
– compose a project corpus of unicode characters
– provide input methods to make the characters easily available
– consider unicode writing directions and collations
– make sure that all characters do not only appear correctly,but that they are also encoded correctly
SUMMARY OF I.
November 2013
Glossarium Graeco-ArabicumTelota
1 Corpus→ How to deal with a database of 70,000+ words?
2 Translation movements→ How to visualize transformations of language structures?
3 Single Lexemes→ How to transform the database into a dictionary?
II. SCHOLARLY REQUIREMENTS
November 2013
Glossarium Graeco-ArabicumTelota
How to deal with a database of 70,000+ words?
– search form→ user needs to know exactly what he/she is looking for
– browsing(e.g. by sources and words in alphabetical order)→ user needs to know roughly what he/she is looking for
– visualization→ statistical and/or graphical approach→ user can explore the corpus
II.1. CORPUS
November 2013
Glossarium Graeco-ArabicumTelota
II.1.a. CORPUS TREEMAP
Distributon of sourcesin the GlossGA corpus
Area size correspondsto number of words
→ Which sources constitutethe major/minor partsof the corpus?
November 2013
Glossarium Graeco-ArabicumTelota
II.1.b. SOURCE TREEMAP
Distribution of wordsin one source
Area size correspondsto number of words
→ What kind ofvocabulary doesconstitutethe source?
November 2013
Glossarium Graeco-ArabicumTelota
How to visualize transformation of language structures?
→ compare parts of speech in diagrams (experimental)
II.2. TRANSLATION MOVEMENTS
November 2013
Glossarium Graeco-ArabicumTelota
Compared Parts of Speech
Blue:
Greek Parts of Speech
Red:
Arabic Parts of Speech
Bar Length:
number of wordsof respective part of speech
II.2.a. TRANSLATION MOVEMENTS
November 2013
Glossarium Graeco-ArabicumTelota
Compared Parts of Speech
X-Axis:
Greek Parts of Speech
Y-Axis:
Arabic Parts of Speech
Intersections:
Dot size represents numberof words transferred fromGreek PoS into Arabic PoS
II.2.b. TRANSLATION MOVEMENTS
November 2013
Glossarium Graeco-ArabicumTelota
How to transform the database into a dictionary?
Experimental preview:
→ collation of all entries of a Greek lexeme→ ordered by Arabic lexeme→ output with source and context
II.3.a. SINGLE LEXEMES
November 2013
Glossarium Graeco-ArabicumTelota
Export function via email:
II.3.b. SINGLE LEXEMES
November 2013
Glossarium Graeco-ArabicumTelota
Recommendations
1 provide multiple access methods→ support various user scenarios
2 invent statistical and visual evaluation methods→ profit from electronic data processing
3 provide conventional scholarly formats→ correspond to the community’s needs
SUMMARY OF II.
November 2013
Glossarium Graeco-ArabicumTelota
Situation: Technical vs. Scholarly Requirements
– which one goes first?
→ technical requirements as necessary basis→ scholarly requirements as superior objective
– both need attention from scholars– both need attention from techies
→ vice versa understanding→ team competence
LAST BUT ONE SLIDE
November 2013
Glossarium Graeco-ArabicumTelota
Thanks to you for your attention!
Project Website http://telota.bbaw.de/glossga
Contact Yury Arzhanov | [email protected] Torsten Roeder | [email protected]