[dcsb] torsten roeder (bbaw), yury arzhanov (ruhruniversität bochum) "the glossarium...

40
The Glossarium Graeco-Arabicum Linguistic Research and Database Design in Polyalphabetic Environments Torsten Roeder (BBAW), Yury Arzhanov (Ruhr Universität Bochum)

Upload: digital-classicist-seminar-berlin

Post on 05-Dec-2014

367 views

Category:

Technology


1 download

DESCRIPTION

 

TRANSCRIPT

Page 1: [DCSB] Torsten Roeder (BBAW), Yury Arzhanov (RuhrUniversität Bochum) "The Glossarium GraecoArabicum. Linguistic Research and Database Design in Polyalphabetic Environments"

The Glossarium Graeco-Arabicum Linguistic Research and Database Design in Polyalphabetic Environments

Torsten Roeder (BBAW), Yury Arzhanov (Ruhr Universität Bochum)

Page 2: [DCSB] Torsten Roeder (BBAW), Yury Arzhanov (RuhrUniversität Bochum) "The Glossarium GraecoArabicum. Linguistic Research and Database Design in Polyalphabetic Environments"

Ms. Paris BnF 5847, f. 5: Muslim scholars in discussion.

Page 3: [DCSB] Torsten Roeder (BBAW), Yury Arzhanov (RuhrUniversität Bochum) "The Glossarium GraecoArabicum. Linguistic Research and Database Design in Polyalphabetic Environments"

Arabic translation of Dioscurides’ Materia medica (Ibn al-%D\৬ƗU�� al-ۛƗPLޏ�OL-PXIUDGƗW�al-adwiya wa-l-aghdhiya, ۜX]4–�1ގ. %njOƗT�1291 H.)

Page 4: [DCSB] Torsten Roeder (BBAW), Yury Arzhanov (RuhrUniversität Bochum) "The Glossarium GraecoArabicum. Linguistic Research and Database Design in Polyalphabetic Environments"

Filecards for the Greek and Arabic Lexicon (GALex)

Page 5: [DCSB] Torsten Roeder (BBAW), Yury Arzhanov (RuhrUniversität Bochum) "The Glossarium GraecoArabicum. Linguistic Research and Database Design in Polyalphabetic Environments"

GALex

Page 6: [DCSB] Torsten Roeder (BBAW), Yury Arzhanov (RuhrUniversität Bochum) "The Glossarium GraecoArabicum. Linguistic Research and Database Design in Polyalphabetic Environments"

The Database Glossarium Græco-Arabicum

Page 7: [DCSB] Torsten Roeder (BBAW), Yury Arzhanov (RuhrUniversität Bochum) "The Glossarium GraecoArabicum. Linguistic Research and Database Design in Polyalphabetic Environments"

The Glossarium Graeco-Arabicum makes available information in the following fields of research:

• the vocabulary and syntax of Classical and Middle Arabic;

• the development of a scientific and technical vocabulary in Arabic;

• the vocabulary of Classical and Middle Greek;

• the chronology and nature of the translation movement into Arabic;

• the establishment of the texts of Greek works and their Arabic translations.

Page 8: [DCSB] Torsten Roeder (BBAW), Yury Arzhanov (RuhrUniversität Bochum) "The Glossarium GraecoArabicum. Linguistic Research and Database Design in Polyalphabetic Environments"

The Glossarium Graeco-Arabicum online:

Page 9: [DCSB] Torsten Roeder (BBAW), Yury Arzhanov (RuhrUniversität Bochum) "The Glossarium GraecoArabicum. Linguistic Research and Database Design in Polyalphabetic Environments"

November 2013

Glossarium Graeco-ArabicumTelota

BERLIN-BRANDENBURGISCHEAKADEMIE DER WISSENSCHAFTEN

Page 10: [DCSB] Torsten Roeder (BBAW), Yury Arzhanov (RuhrUniversität Bochum) "The Glossarium GraecoArabicum. Linguistic Research and Database Design in Polyalphabetic Environments"

November 2013

Glossarium Graeco-ArabicumTelota

I Technical Challenges→ polyalphabetic environment

II Scholarly Requirements→ linguistic database

III Technical vs. Scholarly→ concluding discussion

OUTLINE

Page 11: [DCSB] Torsten Roeder (BBAW), Yury Arzhanov (RuhrUniversität Bochum) "The Glossarium GraecoArabicum. Linguistic Research and Database Design in Polyalphabetic Environments"

November 2013

Glossarium Graeco-ArabicumTelota

1 Languages Used in the GlossGA Interface

2 Unicode Character Corpus

3 Areas of Technical Challenges

4 Examples

I. TECHNICAL CHALLENGES

Page 12: [DCSB] Torsten Roeder (BBAW), Yury Arzhanov (RuhrUniversität Bochum) "The Glossarium GraecoArabicum. Linguistic Research and Database Design in Polyalphabetic Environments"

November 2013

Glossarium Graeco-ArabicumTelota

Languages used within the project:

Ancient Greek Medieval Arabic Modern English

Greek alphabet Arabic alphabet Latin alphabet

3 layers of diacritics optional vowel signs 1 layer of diacritics

LTR (left to right) RTL LTR

I.1. LANGUAGES

Page 13: [DCSB] Torsten Roeder (BBAW), Yury Arzhanov (RuhrUniversität Bochum) "The Glossarium GraecoArabicum. Linguistic Research and Database Design in Polyalphabetic Environments"

November 2013

Glossarium Graeco-ArabicumTelota

Unicode Chart Range Description

C0 Controls and Basic Latin 0000-007F Latin AlphabetLatin Extended-A 0100-017F transliteration symbolsLatin Extended-Additional 1E00-1EFF transliteration symbols

Greek and Coptic 0370-03FF Greek AlphabetGreek Extended 1F00-1FFF Greek Diacritics

Arabic 0600-06FF Arabic AlphabetArabic Supplement 0750-077F Arabic AlphabetSpacing Modifier Letters 02B0-02FF special Arabic characters

→ in total: about 450 different characters from eight different charts

I.2. UNICODE

Page 14: [DCSB] Torsten Roeder (BBAW), Yury Arzhanov (RuhrUniversität Bochum) "The Glossarium GraecoArabicum. Linguistic Research and Database Design in Polyalphabetic Environments"

November 2013

Glossarium Graeco-ArabicumTelota

Requirements:

1. Data input in all three alphabets with all vowels and diacritics → How to implement a comfortable interface?

2. Simultaneous display of texts in three alphabets and two directions → How to implement concurrent writing directions?

3. Search for terms, insensitive for diacritics or vowels → How to implement queries with different collation sets?

I.3. REQUIREMENTS

Page 15: [DCSB] Torsten Roeder (BBAW), Yury Arzhanov (RuhrUniversität Bochum) "The Glossarium GraecoArabicum. Linguistic Research and Database Design in Polyalphabetic Environments"

November 2013

Glossarium Graeco-ArabicumTelota

a Data Input

b Writing Directions

c Search

d Search Terms

I.4. EXAMPLES

Page 16: [DCSB] Torsten Roeder (BBAW), Yury Arzhanov (RuhrUniversität Bochum) "The Glossarium GraecoArabicum. Linguistic Research and Database Design in Polyalphabetic Environments"

November 2013

Glossarium Graeco-ArabicumTelota

ʾ ˒ ʿ ˓I.4.a. DATA INPUT

Page 17: [DCSB] Torsten Roeder (BBAW), Yury Arzhanov (RuhrUniversität Bochum) "The Glossarium GraecoArabicum. Linguistic Research and Database Design in Polyalphabetic Environments"

November 2013

Glossarium Graeco-ArabicumTelota

U+02BE MODIFIER LETTER RIGHT HALF RING

transliteration of Arabic hamza

U+02D2 MODIFIER LETTER CENTRED RIGHT HALF RING

more rounded articulation

U+02BF MODIFIER LETTER LEFT HALF RING

transliteration of Arabic ain

U+02D3 MODIFIER LETTER CENTRED LEFT HALF RING

less rounded articulation

I.4.a. DATA INPUT

[ʾ]

[˒]

[ʿ]

[˓]

Page 18: [DCSB] Torsten Roeder (BBAW), Yury Arzhanov (RuhrUniversität Bochum) "The Glossarium GraecoArabicum. Linguistic Research and Database Design in Polyalphabetic Environments"

November 2013

Glossarium Graeco-ArabicumTelota

Problem: Appearance vs. Encoding

Users will normally choose charaters …

→ not because of their unicode description→ but because of their appearance

How to bring Unicode to the user?

I.4.a. DATA INPUT

Page 19: [DCSB] Torsten Roeder (BBAW), Yury Arzhanov (RuhrUniversität Bochum) "The Glossarium GraecoArabicum. Linguistic Research and Database Design in Polyalphabetic Environments"

November 2013

Glossarium Graeco-ArabicumTelota

Solutions:

– restrict the characters accepted by the database→ safe, but required validation methods

– provide a virtual keyboard (onscreen)→ user-friendly

Alternative methods:

– beta code→ less recommendable from unicode point of view→ but widely used

I.4.a. DATA INPUT

Page 20: [DCSB] Torsten Roeder (BBAW), Yury Arzhanov (RuhrUniversität Bochum) "The Glossarium GraecoArabicum. Linguistic Research and Database Design in Polyalphabetic Environments"

November 2013

Glossarium Graeco-ArabicumTelota

Phenomenon:

Home (THEN) Arabic Glossary (THEN) ص (THEN) صحة

becomes

Home > Arabic Glossary > ص> صحة

I.4.b. WRITING DIRECTIONS

Page 21: [DCSB] Torsten Roeder (BBAW), Yury Arzhanov (RuhrUniversität Bochum) "The Glossarium GraecoArabicum. Linguistic Research and Database Design in Polyalphabetic Environments"

November 2013

Glossarium Graeco-ArabicumTelota

Problem: Strong vs. Weak Characters

In Unicode, alphabetic characters are usually

STRONG CHARACTERS

which determine the writing direction,

while punctuation characters are usually

WEAK CHARACTERS

which do not change the writing direction.

→ relevant in:

comma separated lists, bibliographic references,

breadcrumb lines, table alignments …

I.4.b. WRITING DIRECTIONS

Page 22: [DCSB] Torsten Roeder (BBAW), Yury Arzhanov (RuhrUniversität Bochum) "The Glossarium GraecoArabicum. Linguistic Research and Database Design in Polyalphabetic Environments"

November 2013

Glossarium Graeco-ArabicumTelota

Solutions:

– insert a ”strong whitespace”:Unicodes U+200E (left to right) or U+200F (right to left)

– or, if in HTML, set the writing direction directly:<span dir="ltr">…</span>

I.4.b. WRITING DIRECTIONS

Page 23: [DCSB] Torsten Roeder (BBAW), Yury Arzhanov (RuhrUniversität Bochum) "The Glossarium GraecoArabicum. Linguistic Research and Database Design in Polyalphabetic Environments"

November 2013

Glossarium Graeco-ArabicumTelota

GREEK ARABIC ENGLISH

diacritics vowel signs diacriticsnot distinct not distinct distinct

requirement: requirement: requirement:η finds also ἠ ἦ ἥ ب8بfinds also 7 سبب س8 d does not find ḏ

Problem: Distinction vs. Collation

I.4.c. SEARCH

Page 24: [DCSB] Torsten Roeder (BBAW), Yury Arzhanov (RuhrUniversität Bochum) "The Glossarium GraecoArabicum. Linguistic Research and Database Design in Polyalphabetic Environments"

November 2013

Glossarium Graeco-ArabicumTelota

Solution:

Greek Arabic English

Greek collation Arabic collation Latin collation

Collation Charts: <http://unicode.org/charts/uca/>

Restrictions:

– does not work for mixed texts→ data needs to be separated

– some environments do not support Arabic vowel collation→ e.g. MySQL <6.0

I.4.c. SEARCH

Page 25: [DCSB] Torsten Roeder (BBAW), Yury Arzhanov (RuhrUniversität Bochum) "The Glossarium GraecoArabicum. Linguistic Research and Database Design in Polyalphabetic Environments"

November 2013

Glossarium Graeco-ArabicumTelota

Phenomenon:

– user searches for Arabic words starting with مل– truncation sysmbol (asterisk) appears at the wrong side

*ملProblem: Neutral Writing Direction

– the standard asterisk is a NEUTRAL CHARACTER– it adapts the main writing direction

I.4.d. SEARCH TERMS

Page 26: [DCSB] Torsten Roeder (BBAW), Yury Arzhanov (RuhrUniversität Bochum) "The Glossarium GraecoArabicum. Linguistic Research and Database Design in Polyalphabetic Environments"

November 2013

Glossarium Graeco-ArabicumTelota

Solution:

Unicode Arabic Asterisk (U+066D), right-to-left

مل٭

I.4.d. SEARCH TERMS

Page 27: [DCSB] Torsten Roeder (BBAW), Yury Arzhanov (RuhrUniversität Bochum) "The Glossarium GraecoArabicum. Linguistic Research and Database Design in Polyalphabetic Environments"

November 2013

Glossarium Graeco-ArabicumTelota

Challenges for the Developer:

– Unicode does not provide general truncation or joker symbols

– different asterisk and joker signs must be processed

– no standard solution available

I.4.d. SEARCH TERMS

Page 28: [DCSB] Torsten Roeder (BBAW), Yury Arzhanov (RuhrUniversität Bochum) "The Glossarium GraecoArabicum. Linguistic Research and Database Design in Polyalphabetic Environments"

November 2013

Glossarium Graeco-ArabicumTelota

Technical Recommendations for Polyalphabetic Environments

– use software components that supports unicode thoughout

– compose a project corpus of unicode characters

– provide input methods to make the characters easily available

– consider unicode writing directions and collations

– make sure that all characters do not only appear correctly,but that they are also encoded correctly

SUMMARY OF I.

Page 29: [DCSB] Torsten Roeder (BBAW), Yury Arzhanov (RuhrUniversität Bochum) "The Glossarium GraecoArabicum. Linguistic Research and Database Design in Polyalphabetic Environments"

November 2013

Glossarium Graeco-ArabicumTelota

1 Corpus→ How to deal with a database of 70,000+ words?

2 Translation movements→ How to visualize transformations of language structures?

3 Single Lexemes→ How to transform the database into a dictionary?

II. SCHOLARLY REQUIREMENTS

Page 30: [DCSB] Torsten Roeder (BBAW), Yury Arzhanov (RuhrUniversität Bochum) "The Glossarium GraecoArabicum. Linguistic Research and Database Design in Polyalphabetic Environments"

November 2013

Glossarium Graeco-ArabicumTelota

How to deal with a database of 70,000+ words?

– search form→ user needs to know exactly what he/she is looking for

– browsing(e.g. by sources and words in alphabetical order)→ user needs to know roughly what he/she is looking for

– visualization→ statistical and/or graphical approach→ user can explore the corpus

II.1. CORPUS

Page 31: [DCSB] Torsten Roeder (BBAW), Yury Arzhanov (RuhrUniversität Bochum) "The Glossarium GraecoArabicum. Linguistic Research and Database Design in Polyalphabetic Environments"

November 2013

Glossarium Graeco-ArabicumTelota

II.1.a. CORPUS TREEMAP

Distributon of sourcesin the GlossGA corpus

Area size correspondsto number of words

→ Which sources constitutethe major/minor partsof the corpus?

Page 32: [DCSB] Torsten Roeder (BBAW), Yury Arzhanov (RuhrUniversität Bochum) "The Glossarium GraecoArabicum. Linguistic Research and Database Design in Polyalphabetic Environments"

November 2013

Glossarium Graeco-ArabicumTelota

II.1.b. SOURCE TREEMAP

Distribution of wordsin one source

Area size correspondsto number of words

→ What kind ofvocabulary doesconstitutethe source?

Page 33: [DCSB] Torsten Roeder (BBAW), Yury Arzhanov (RuhrUniversität Bochum) "The Glossarium GraecoArabicum. Linguistic Research and Database Design in Polyalphabetic Environments"

November 2013

Glossarium Graeco-ArabicumTelota

How to visualize transformation of language structures?

→ compare parts of speech in diagrams (experimental)

II.2. TRANSLATION MOVEMENTS

Page 34: [DCSB] Torsten Roeder (BBAW), Yury Arzhanov (RuhrUniversität Bochum) "The Glossarium GraecoArabicum. Linguistic Research and Database Design in Polyalphabetic Environments"

November 2013

Glossarium Graeco-ArabicumTelota

Compared Parts of Speech

Blue:

Greek Parts of Speech

Red:

Arabic Parts of Speech

Bar Length:

number of wordsof respective part of speech

II.2.a. TRANSLATION MOVEMENTS

Page 35: [DCSB] Torsten Roeder (BBAW), Yury Arzhanov (RuhrUniversität Bochum) "The Glossarium GraecoArabicum. Linguistic Research and Database Design in Polyalphabetic Environments"

November 2013

Glossarium Graeco-ArabicumTelota

Compared Parts of Speech

X-Axis:

Greek Parts of Speech

Y-Axis:

Arabic Parts of Speech

Intersections:

Dot size represents numberof words transferred fromGreek PoS into Arabic PoS

II.2.b. TRANSLATION MOVEMENTS

Page 36: [DCSB] Torsten Roeder (BBAW), Yury Arzhanov (RuhrUniversität Bochum) "The Glossarium GraecoArabicum. Linguistic Research and Database Design in Polyalphabetic Environments"

November 2013

Glossarium Graeco-ArabicumTelota

How to transform the database into a dictionary?

Experimental preview:

→ collation of all entries of a Greek lexeme→ ordered by Arabic lexeme→ output with source and context

II.3.a. SINGLE LEXEMES

Page 37: [DCSB] Torsten Roeder (BBAW), Yury Arzhanov (RuhrUniversität Bochum) "The Glossarium GraecoArabicum. Linguistic Research and Database Design in Polyalphabetic Environments"

November 2013

Glossarium Graeco-ArabicumTelota

Export function via email:

II.3.b. SINGLE LEXEMES

Page 38: [DCSB] Torsten Roeder (BBAW), Yury Arzhanov (RuhrUniversität Bochum) "The Glossarium GraecoArabicum. Linguistic Research and Database Design in Polyalphabetic Environments"

November 2013

Glossarium Graeco-ArabicumTelota

Recommendations

1 provide multiple access methods→ support various user scenarios

2 invent statistical and visual evaluation methods→ profit from electronic data processing

3 provide conventional scholarly formats→ correspond to the community’s needs

SUMMARY OF II.

Page 39: [DCSB] Torsten Roeder (BBAW), Yury Arzhanov (RuhrUniversität Bochum) "The Glossarium GraecoArabicum. Linguistic Research and Database Design in Polyalphabetic Environments"

November 2013

Glossarium Graeco-ArabicumTelota

Situation: Technical vs. Scholarly Requirements

– which one goes first?

→ technical requirements as necessary basis→ scholarly requirements as superior objective

– both need attention from scholars– both need attention from techies

→ vice versa understanding→ team competence

LAST BUT ONE SLIDE

Page 40: [DCSB] Torsten Roeder (BBAW), Yury Arzhanov (RuhrUniversität Bochum) "The Glossarium GraecoArabicum. Linguistic Research and Database Design in Polyalphabetic Environments"

November 2013

Glossarium Graeco-ArabicumTelota

Thanks to you for your attention!

Project Website http://telota.bbaw.de/glossga

Contact Yury Arzhanov | [email protected] Torsten Roeder | [email protected]