e lex presentation_03
TRANSCRIPT
![Page 1: E lex presentation_03](https://reader033.vdocuments.net/reader033/viewer/2022052523/555eb0fdd8b42a902e8b5584/html5/thumbnails/1.jpg)
Lexical Profiling for Arabic
Mohammed Attia, Pavel Pecina, Antonio Toral, Lamia Tounsi, Josef van Genabith
National Centre for Language Technology (NCLT),
School of Computing, Dublin City University
Funded by:
Enterprise Ireland, the Irish Research Council for Science
Engineering and Technology (IRCSET), and
the EU projects PANACEA and META-NET
![Page 2: E lex presentation_03](https://reader033.vdocuments.net/reader033/viewer/2022052523/555eb0fdd8b42a902e8b5584/html5/thumbnails/2.jpg)
Overview
• Introduction
• Building the lexical database for Arabic– Corpus-based Selection of Entries
– Morphological Details: Inflectional Paradigms
– Syntactic Details: Subcategorization Frames
• Web Application
• Conclusion
![Page 3: E lex presentation_03](https://reader033.vdocuments.net/reader033/viewer/2022052523/555eb0fdd8b42a902e8b5584/html5/thumbnails/3.jpg)
Introduction
• Modern Standard Arabic vs. Classical Arabic
• Current State of Arabic Lexicography– Lexicons are not corpus-based
– Buckwalter Electronic Dictionary and Arabic Morphological Analyser
– No lexica for subcategorization frames
• Importance of Lexical Resources
![Page 4: E lex presentation_03](https://reader033.vdocuments.net/reader033/viewer/2022052523/555eb0fdd8b42a902e8b5584/html5/thumbnails/4.jpg)
Introduction
• Arabic Morphotactics
![Page 5: E lex presentation_03](https://reader033.vdocuments.net/reader033/viewer/2022052523/555eb0fdd8b42a902e8b5584/html5/thumbnails/5.jpg)
Aim
• Constructing a lexical database of Modern Standard Arabic
• Constructing a database for Arabic subcategorization frames
![Page 6: E lex presentation_03](https://reader033.vdocuments.net/reader033/viewer/2022052523/555eb0fdd8b42a902e8b5584/html5/thumbnails/6.jpg)
Methodology
Lexical Details
• Using a medium-scale manually created lexicon of 10,799 lemmas
• Using statistics from a 1 billion word corpus (annotated by MADA)
– 90% from the LDC's Arabic Gigaword
– 10% collected from the Al-Jazeera website
Subcategorization Details
• Using a medium-scale manually created lexicon of 2,901 lemma-frame types
• Using the Penn Arabic Treebank of 22,524 sentences, and 587,665 words
![Page 7: E lex presentation_03](https://reader033.vdocuments.net/reader033/viewer/2022052523/555eb0fdd8b42a902e8b5584/html5/thumbnails/7.jpg)
Extending the Lexical Database
• Start-off with a seed lexicon– Three Lexical Databases, manually constructed
• 5,925 nominal lemmas, with details on:– Gender and number
– Inflection paradigm (13 continuation classes)
– Humanness
• 1,529 verb lemmas, with details on:– Transitivity
– Whether passive is allowed or not
– Whether the imperative is allowed or not
• 490 patterns (456 for nominals and 34 for verbs)
• lemma-root look up database
![Page 8: E lex presentation_03](https://reader033.vdocuments.net/reader033/viewer/2022052523/555eb0fdd8b42a902e8b5584/html5/thumbnails/8.jpg)
Methodology
![Page 9: E lex presentation_03](https://reader033.vdocuments.net/reader033/viewer/2022052523/555eb0fdd8b42a902e8b5584/html5/thumbnails/9.jpg)
Extending the Lexical Database
• Automatically Extending the Lexical Database: Lexical Enrichment– Data-driven filtering technique
• 40,648 lemmas (in Buckwalter or SAMA 3.1)
• Statistics from three web search engines• Statistics from the corpus annotated by MADA• 29,627 lemmas (left after filtering)
![Page 10: E lex presentation_03](https://reader033.vdocuments.net/reader033/viewer/2022052523/555eb0fdd8b42a902e8b5584/html5/thumbnails/10.jpg)
Extending the Lexical Database
Automatically Extending the Lexical Database: Feature Enrichment
– Machine Learning– Multilayer Peceptron classification algorithm
– Training Data: 4,816 nominals and 1,448 verbs
– Classes for nominals: continuation classes (or inflection paths), the semantico-grammatical feature of humanness, and POS (noun or adjective)
– Classes for verbs: transitivity, allowing the passive voice, and allowing the imperative mood
– We feed these datasets with frequency statistics from the corpus and build a vector grid.
![Page 11: E lex presentation_03](https://reader033.vdocuments.net/reader033/viewer/2022052523/555eb0fdd8b42a902e8b5584/html5/thumbnails/11.jpg)
Extending the Lexical Database
• Extending the Lexical Database– Feature enrichment using Machine Learning
![Page 12: E lex presentation_03](https://reader033.vdocuments.net/reader033/viewer/2022052523/555eb0fdd8b42a902e8b5584/html5/thumbnails/12.jpg)
Extending the Lexical Database
• Extending the Lexical Database– With Machine Learning we add:
18,000 new lemmas: 12,974 nominals 5,034 verbs
![Page 13: E lex presentation_03](https://reader033.vdocuments.net/reader033/viewer/2022052523/555eb0fdd8b42a902e8b5584/html5/thumbnails/13.jpg)
Extending the Lexical Database
• Handling Broken PluralsjAnib (side)jawAnib (sides)
Poor handling of broken plural in Buckwalter
(4) <lemmaID>jAnib_1</lemmaID> <voc>jAnib</voc> <pos>jAnib/NOUN</pos> <gloss>side/aspect</gloss>
(5) <lemmaID>jAnib_1</lemmaID> <voc>jawAnib</voc> <pos>jawAnib/NOUN</pos> <gloss>sides/aspects</gloss>
Two differences: voc and gloss
![Page 14: E lex presentation_03](https://reader033.vdocuments.net/reader033/viewer/2022052523/555eb0fdd8b42a902e8b5584/html5/thumbnails/14.jpg)
Extending the Lexical Database
• Extracting Broken Plurals<gloss>side/aspect</gloss>
<gloss>sides/aspects</gloss>
We use Levenshtein Distance which measures the difference between two strings (here glosses having the same lemmaID).
distance of 2 / length of the first string = 0.15 (within the threshold 0.4)
We collect 2,266 candidates
![Page 15: E lex presentation_03](https://reader033.vdocuments.net/reader033/viewer/2022052523/555eb0fdd8b42a902e8b5584/html5/thumbnails/15.jpg)
Extending the Lexical Database
• Validating Broken Plurals<voc>jAnib</voc> singular
pattern is: fAEilregex is: .A.i.
<voc>jawAnib</voc> pluralpattern is: fawAEilregex is: .awA.i.
Pattern database: 135 singular patterns that choose from a set of 82 broken plural patterns
2,266 candidates -> 1,965 are validated (87%)
![Page 16: E lex presentation_03](https://reader033.vdocuments.net/reader033/viewer/2022052523/555eb0fdd8b42a902e8b5584/html5/thumbnails/16.jpg)
Extending the Lexical Database
• Interesting statistics on Arabic pluralsInsights from the corpus:
5,570 lemmas have a feminine plural suffix
1,942 lemmas have a masculine plural suffix
2,730 lemmas with a broken plural forms
![Page 17: E lex presentation_03](https://reader033.vdocuments.net/reader033/viewer/2022052523/555eb0fdd8b42a902e8b5584/html5/thumbnails/17.jpg)
Extraction of Subcat Frames
• Importance of subcategorization frames
• Advantage of Automatic Extraction
• Available Resource on Arabic Subcat Frames:
– none except Arabic LFG Parser (Attia, 2008) - available as open source
![Page 18: E lex presentation_03](https://reader033.vdocuments.net/reader033/viewer/2022052523/555eb0fdd8b42a902e8b5584/html5/thumbnails/18.jpg)
Extraction of Subcat Frames
What are LFG subcat frames? Governable GFs (SUBJ, OBJ, OBJϴ, OBLϴ, COMP
and XCOMP) Non-governable GFs (ADJ and XADJ)
π<gf1,gf2,…gfn>
{iEotamada Al-Tifolu EalaY wAlidati-hi “The child relied on his mother”
{iEotamada<(↑SUBJ)( ↑OBL>alaY)>
![Page 19: E lex presentation_03](https://reader033.vdocuments.net/reader033/viewer/2022052523/555eb0fdd8b42a902e8b5584/html5/thumbnails/19.jpg)
Extraction of Subcat Frames
Automatic extraction of subcat frames The ATB contains 22,524 sentences LFG Annotation algorithm (DCU) Traversing trees and looking for dependencies. Lemmatization We extract 7,746 lemma-frame types (for verbs, nouns and
adjectives)
![Page 20: E lex presentation_03](https://reader033.vdocuments.net/reader033/viewer/2022052523/555eb0fdd8b42a902e8b5584/html5/thumbnails/20.jpg)
Extraction of Subcat Frames
Estimating the Subcategorization Probability
![Page 21: E lex presentation_03](https://reader033.vdocuments.net/reader033/viewer/2022052523/555eb0fdd8b42a902e8b5584/html5/thumbnails/21.jpg)
Extraction of Subcat Frames
Evaluation the Subcategorization Extraction
![Page 22: E lex presentation_03](https://reader033.vdocuments.net/reader033/viewer/2022052523/555eb0fdd8b42a902e8b5584/html5/thumbnails/22.jpg)
Extraction of Subcat Frames
Evaluation the Subcategorization Extraction
![Page 23: E lex presentation_03](https://reader033.vdocuments.net/reader033/viewer/2022052523/555eb0fdd8b42a902e8b5584/html5/thumbnails/23.jpg)
Web Application• AraComLex Lexicon Writing Application
www.cngl.ie/aracomlex
![Page 24: E lex presentation_03](https://reader033.vdocuments.net/reader033/viewer/2022052523/555eb0fdd8b42a902e8b5584/html5/thumbnails/24.jpg)
Byproducts of the Work
A number of open-source Resources:
• finite-state morphological transducer Arabic morphological patterns Subcategorization frames Arabic lemma frequency counts
![Page 25: E lex presentation_03](https://reader033.vdocuments.net/reader033/viewer/2022052523/555eb0fdd8b42a902e8b5584/html5/thumbnails/25.jpg)
Conclusion
• We successfully use machine learning to predict morpho-syntactic features for newly acquired words.
• We successfully extract subcategorization frames from the Penn Arabic Treebank
• We build specifications and implementation for an Arabic lexicographic web application.