improving accessibility of archived raster dictionaries of complex script languages

30
Improving Accessibility of Archived Raster Dictionaries of Complex Script Languages Computer Science Department, Old Dominion University Norfolk, Virginia - 23529 Sawood Alam National University of Sciences and Technology Islamabad, Pakistan Fateh ud din B Mehmood Computer Science Department, Old Dominion University Norfolk, Virginia - 23529 Michael L. Nelson

Upload: sawood-alam

Post on 06-Aug-2015

719 views

Category:

Internet


2 download

TRANSCRIPT

Improving Accessibility ofArchived Raster Dictionaries of

Complex Script Languages

Computer Science Department, Old Dominion UniversityNorfolk, Virginia - 23529

Sawood Alam

National University of Sciences and TechnologyIslamabad, Pakistan

Fateh ud din B Mehmood

Computer Science Department, Old Dominion UniversityNorfolk, Virginia - 23529

Michael L. Nelson

The Time Travel

OK Google, Define Dictionarya book or electronic resource that liststhe words of a language (typically inalphabetical order) and gives theirmeaning, or gives the equivalent wordsin a different language, often alsoproviding information aboutpronunciation, origin, and usage.

Dictionaries Are DifferentRead: random accessWrite: maintain sort orderThe most compact mode topreserve a language

Problem: English Dictionary

Johnson's English dictionary

Problem: Urdu Dictionary

Farhang-e-Asifiyah

Related Work

Unicode CollationOrdered assembly of written informationUnicode values != natural collationArabic script: U+0600 to U+06FFOut of order alphabets in derived languagesCommon Locale Data Repository (CLDR)

Collation DiscrepanciesCompound lettersDiacritical marksHalf lettersPrefixes

Nested OrderingRoot word sorting (Arabic)

Morphological derivationDerived word simplification

Radicals and strokes (Chinese)

Indexing: Ordered Pages

Indexing: Sparse Index

Indexing: Full Index

Indexing: Location Index

Indexing State Transition

Annotation

Digitization

Dictionary ExplorerMultilingual Multi-dictionary LookupSearching and ExploringAnnotation and digitizationUser Contribution and FeedbackOpen Source => GitHub:/urduweb/DictionaryExplorer

Dictionary Explorer: English

Dictionary Explorer: English

Dictionary Explorer: Urdu

Dictionary Explorer: Urdu

Indexing TimeDictionary Pages Index Mode Time

English toUrdu

180 Sparse Manual andScript

10minutes

MonolingualUrdu

2,500 Sparse Manual 2 hours

MonolingualClassic Urdu

3,200 Full* Crowdsource** 60 days

* 75,000 words, phrases, proverbs, and idioms** 13 contributors

Prefix Permutations

Prefix: One

Prefix: Two

Prefix: Three

Prefix: Four

Prefix: Five

Prefix: Six

Conclusions and Future WorkIdentified issues

Too many matchesLack of fielded searchingLack of OCR supportNo input method assistance

Collation chalangesAccessibility levels: Ordered Pages, Sparse, Full, andLocation indexes, annotation, and digitizationImplemented a multi-lingual multi-dictionary explorerEffort and prefix evaluationIn future: elastic index and automatic region estimsteGitHub:/urduweb/DictionaryExplorer