nlp$resources: construcon,$ standardizaon , exploita.on
TRANSCRIPT
![Page 1: NLP$resources: construcon,$ standardizaon , exploita.on](https://reader030.vdocuments.net/reader030/viewer/2022012413/616d67257772d666615ff794/html5/thumbnails/1.jpg)
NLP resources:
construc.on, standardiza.on, exploita.on & API
Karim Bouzoubaa
![Page 2: NLP$resources: construcon,$ standardizaon , exploita.on](https://reader030.vdocuments.net/reader030/viewer/2022012413/616d67257772d666615ff794/html5/thumbnails/2.jpg)
outline
• Exploita.on • NLP resources • Construc.on • Standardiza.on • API
![Page 3: NLP$resources: construcon,$ standardizaon , exploita.on](https://reader030.vdocuments.net/reader030/viewer/2022012413/616d67257772d666615ff794/html5/thumbnails/3.jpg)
Exploita.on
![Page 4: NLP$resources: construcon,$ standardizaon , exploita.on](https://reader030.vdocuments.net/reader030/viewer/2022012413/616d67257772d666615ff794/html5/thumbnails/4.jpg)
Exploitation
LRs are used in various NLP so7ware tools: • morphological, syntac@c and seman@c analysis • automa@c transla@on • automa@c genera@on of texts • spell-‐checking • automa@c summariza@on • handwri@ng recogni@on • reformula@on and paraphrasing • informa@on search and text mining
4
![Page 5: NLP$resources: construcon,$ standardizaon , exploita.on](https://reader030.vdocuments.net/reader030/viewer/2022012413/616d67257772d666615ff794/html5/thumbnails/5.jpg)
outline
• Exploita.on • NLP resources • Construc.on • Standardiza.on • API
![Page 6: NLP$resources: construcon,$ standardizaon , exploita.on](https://reader030.vdocuments.net/reader030/viewer/2022012413/616d67257772d666615ff794/html5/thumbnails/6.jpg)
NLP Resources
![Page 7: NLP$resources: construcon,$ standardizaon , exploita.on](https://reader030.vdocuments.net/reader030/viewer/2022012413/616d67257772d666615ff794/html5/thumbnails/7.jpg)
Resources
Introduction – Definition Types Examples Evaluation criteria
![Page 8: NLP$resources: construcon,$ standardizaon , exploita.on](https://reader030.vdocuments.net/reader030/viewer/2022012413/616d67257772d666615ff794/html5/thumbnails/8.jpg)
Introduc.on -‐ Defini.on
q The key to NLT development is the Language Resource q Resource produc@on takes a lot of effort and is very expensive
Example: The Arabic standard LC-‐STAR phone@c lexicon of the European Linguis@c Resource Associa@on (ELRA) with 110,271 entries costs 21250.00 EUR (for use in academic research)
8
Language resources are language-related data,
accessible in an electronic format, and used for
the development of NLP systems
![Page 9: NLP$resources: construcon,$ standardizaon , exploita.on](https://reader030.vdocuments.net/reader030/viewer/2022012413/616d67257772d666615ff794/html5/thumbnails/9.jpg)
![Page 10: NLP$resources: construcon,$ standardizaon , exploita.on](https://reader030.vdocuments.net/reader030/viewer/2022012413/616d67257772d666615ff794/html5/thumbnails/10.jpg)
1. Corpus • wriTen: monolingual texts, mul@lingual texts, annoted texts,
treebanks
• speech: reading texts aloud, speeches, dialogues, radio and television broadcasts
• Mul@media: images, sounds and videos
2. Lexicon • monolingual and mul@lingual Dic@onaries
• Gaze@ers (geographical dic@onary) • Terminologies
• ontologies
Types – 2 categories
![Page 11: NLP$resources: construcon,$ standardizaon , exploita.on](https://reader030.vdocuments.net/reader030/viewer/2022012413/616d67257772d666615ff794/html5/thumbnails/11.jpg)
An entry in the lexicon may contain :
• morphological, syntac@c, seman@c and pragma@c
informa@on
• the gramma@cal category (noun, verb, etc.),
o subcategory proper@es (transi@ve verb or not, masculine
or feminine)
• seman@c informa@on (animated name, verb requiring a
human subject
Content of a lexicon
![Page 12: NLP$resources: construcon,$ standardizaon , exploita.on](https://reader030.vdocuments.net/reader030/viewer/2022012413/616d67257772d666615ff794/html5/thumbnails/12.jpg)
12
Examples
![Page 13: NLP$resources: construcon,$ standardizaon , exploita.on](https://reader030.vdocuments.net/reader030/viewer/2022012413/616d67257772d666615ff794/html5/thumbnails/13.jpg)
Oxford dic.onary
![Page 14: NLP$resources: construcon,$ standardizaon , exploita.on](https://reader030.vdocuments.net/reader030/viewer/2022012413/616d67257772d666615ff794/html5/thumbnails/14.jpg)
verbNet
![Page 15: NLP$resources: construcon,$ standardizaon , exploita.on](https://reader030.vdocuments.net/reader030/viewer/2022012413/616d67257772d666615ff794/html5/thumbnails/15.jpg)
q Formal (regardless of content) § Size § Maintenance (durability, scalability) § Compa@bility
q Func.onal (language criteria) § Lexicographic annota@on (existence and
relevance) § Intrinsic rules
Evalua@on criteria
![Page 16: NLP$resources: construcon,$ standardizaon , exploita.on](https://reader030.vdocuments.net/reader030/viewer/2022012413/616d67257772d666615ff794/html5/thumbnails/16.jpg)
outline
• Exploita.on • NLP resources • Construc.on • Standardiza.on • API
![Page 17: NLP$resources: construcon,$ standardizaon , exploita.on](https://reader030.vdocuments.net/reader030/viewer/2022012413/616d67257772d666615ff794/html5/thumbnails/17.jpg)
Construc.on
![Page 18: NLP$resources: construcon,$ standardizaon , exploita.on](https://reader030.vdocuments.net/reader030/viewer/2022012413/616d67257772d666615ff794/html5/thumbnails/18.jpg)
Construc@on
Produc.on cycle Crea@ng resources Example (Contempory Arabic) Reusing ressources Example of free resources
Good prac.ces Documenta@on Interoperability Viability
![Page 19: NLP$resources: construcon,$ standardizaon , exploita.on](https://reader030.vdocuments.net/reader030/viewer/2022012413/616d67257772d666615ff794/html5/thumbnails/19.jpg)
two approaches for developing LRs:
q creating new resources
q tuning existing resources
19
crea.ng resources
![Page 20: NLP$resources: construcon,$ standardizaon , exploita.on](https://reader030.vdocuments.net/reader030/viewer/2022012413/616d67257772d666615ff794/html5/thumbnails/20.jpg)
Collect "authen@c" data, of a general
nature or belonging to a par@cular sector
of ac@vity, directly in digital form or, in
some cases, by digi@zing them.
20
crea.ng resources
![Page 21: NLP$resources: construcon,$ standardizaon , exploita.on](https://reader030.vdocuments.net/reader030/viewer/2022012413/616d67257772d666615ff794/html5/thumbnails/21.jpg)
Contemporary Arabic
Example of creating resources
![Page 22: NLP$resources: construcon,$ standardizaon , exploita.on](https://reader030.vdocuments.net/reader030/viewer/2022012413/616d67257772d666615ff794/html5/thumbnails/22.jpg)
q The opera@on of making changes to a resource for the purpose of performing certain func@ons and improving it in a different usage environment from the original one
q Example: ....
22
Resources’ Reuse
![Page 23: NLP$resources: construcon,$ standardizaon , exploita.on](https://reader030.vdocuments.net/reader030/viewer/2022012413/616d67257772d666615ff794/html5/thumbnails/23.jpg)
Corpus q Corpus of Contemporary Arabic q Khoja POS tagged corpus q Quranic Arabic q Collec@on of free arabic texts and books:
- Almeshkat - Al-‐Eman
Lexicon q Buckwalter’s list of Arabic roots q Al-‐Baheth Al-‐Arabi
23
Example of free resources
![Page 24: NLP$resources: construcon,$ standardizaon , exploita.on](https://reader030.vdocuments.net/reader030/viewer/2022012413/616d67257772d666615ff794/html5/thumbnails/24.jpg)
In order to contribute to the crea@on of a set of
sustainable RLs, some principles must be
respected:
• Resource documenta@on
• Interoperability of resources 24
Good prac@ces
![Page 25: NLP$resources: construcon,$ standardizaon , exploita.on](https://reader030.vdocuments.net/reader030/viewer/2022012413/616d67257772d666615ff794/html5/thumbnails/25.jpg)
LRs are o7en poorly documented or undocumented at all.
Documenta@on should be as comprehensive as possible,
and include informa@on on:
• the format of the data
• the content of the data
• the produc@on context
• the possible uses 25
Documenta.on of resources
![Page 26: NLP$resources: construcon,$ standardizaon , exploita.on](https://reader030.vdocuments.net/reader030/viewer/2022012413/616d67257772d666615ff794/html5/thumbnails/26.jpg)
q The interoperability of LRs is the ability to operate in different systems
q The formats of the LRs must be standard
26
Resources interoperability
![Page 27: NLP$resources: construcon,$ standardizaon , exploita.on](https://reader030.vdocuments.net/reader030/viewer/2022012413/616d67257772d666615ff794/html5/thumbnails/27.jpg)
Many difficul@es are encountered when reusing available LRs
Interoperability – documentation - reuse
![Page 28: NLP$resources: construcon,$ standardizaon , exploita.on](https://reader030.vdocuments.net/reader030/viewer/2022012413/616d67257772d666615ff794/html5/thumbnails/28.jpg)
• Contribute to the development of LRs respec@ng interoperability rules
– Availability
– Portability
– Reusability
– normaliza@on
Interoperability – documentation - reuse
![Page 29: NLP$resources: construcon,$ standardizaon , exploita.on](https://reader030.vdocuments.net/reader030/viewer/2022012413/616d67257772d666615ff794/html5/thumbnails/29.jpg)
outline
• Exploita.on • NLP resources • Construc.on • Standardiza.on • API
![Page 30: NLP$resources: construcon,$ standardizaon , exploita.on](https://reader030.vdocuments.net/reader030/viewer/2022012413/616d67257772d666615ff794/html5/thumbnails/30.jpg)
Standardiza.on
![Page 31: NLP$resources: construcon,$ standardizaon , exploita.on](https://reader030.vdocuments.net/reader030/viewer/2022012413/616d67257772d666615ff794/html5/thumbnails/31.jpg)
q How to integrate exis@ng resources into one's own
contexts?
q How to separate the resources from the tools that
manage them?
why?
![Page 32: NLP$resources: construcon,$ standardizaon , exploita.on](https://reader030.vdocuments.net/reader030/viewer/2022012413/616d67257772d666615ff794/html5/thumbnails/32.jpg)
standardisation agencies: CNIS: China National Institute of Standardization FNOR: Agence Française de Normalisation DIN: Deutsches Institut für Normung ANSI: American National Standards Institute W3C: World Wide Web Consortium TEI: Text Encoding Initiative ISO: the International Organization for Standardization
projects:
LIRICS :Linguistic Infrastructure for Interoperable Resources and Systems EAGLES: Expert Advisory Group on Language Engineering Standards Multext : Multilingual Text Tools and Corpora
research structures:
CLARIN: Common Language Resources and Technology Infrastructure FLaReNet : Fostering Language Resources Network Alpage : Analyse Linguistique Profonde A Grande Echelle.
Panorama
![Page 33: NLP$resources: construcon,$ standardizaon , exploita.on](https://reader030.vdocuments.net/reader030/viewer/2022012413/616d67257772d666615ff794/html5/thumbnails/33.jpg)
Organization
![Page 34: NLP$resources: construcon,$ standardizaon , exploita.on](https://reader030.vdocuments.net/reader030/viewer/2022012413/616d67257772d666615ff794/html5/thumbnails/34.jpg)
Préparatoire new project of the WG
Préliminaire Preliminary Work Item (PWI)
Proposition New Work Item Proposal (NP)
Commission Committee Draft (CD)
Approbation Final Draft International Standard (FDIS)
Enquête Draft International Standard (DIS)
Publication International Standard (IS)
standards proposition
![Page 35: NLP$resources: construcon,$ standardizaon , exploita.on](https://reader030.vdocuments.net/reader030/viewer/2022012413/616d67257772d666615ff794/html5/thumbnails/35.jpg)
LMF
• Modeling Arabic inflec@on paradigms according to the LMF standard – Aïda Khemakhem et al. 2007
• Automa@c conversion of editorial dic@onaries to LMF – Feten Baccar et al. 2008, Aïda Khemakhem et al. 2009
• Domain ontology genera@on from LMF dic@onaries – Feten Baccar et al. 2010
• Proposed standardized representa@on of standard Arabic lexicons – Susanne Salmon-‐Alt et al 2013
• Detec@on of anomalies and evalua@on of the content of LMF dic@onaries – Wafa WALI et al. 2014
• Realiza@on of a system of produc@on of Arabic dic@onaries respec@ng the LMF standard – Mohammed Reqqass et al. 2014
![Page 36: NLP$resources: construcon,$ standardizaon , exploita.on](https://reader030.vdocuments.net/reader030/viewer/2022012413/616d67257772d666615ff794/html5/thumbnails/36.jpg)
LMF Example
![Page 37: NLP$resources: construcon,$ standardizaon , exploita.on](https://reader030.vdocuments.net/reader030/viewer/2022012413/616d67257772d666615ff794/html5/thumbnails/37.jpg)
LMF Example
![Page 38: NLP$resources: construcon,$ standardizaon , exploita.on](https://reader030.vdocuments.net/reader030/viewer/2022012413/616d67257772d666615ff794/html5/thumbnails/38.jpg)
TEI
<TEI> <teiHeader> <name> NAFIS Arabic Stemming Gold Standard</name> ... </teiHeader> <text> <phr> <val> أأسسااسس ففإإننهه ببااللججدد ععللييككمم <val/>االلننججااحح <w rend="ععللييككمم"> <choice n="14"> <seg> <m type="prefix"></m> <form type="base"> <m type="root">ععلليي</m> <m type="stem">ععَللَيي</m>
</form> <m type="suffix">ككمم</m> </seg> <seg> <m type="prefix"></m> <form type="base"> <m type="root">ععلليي</m> <m type="stem">َععللِيي</m> </form> <m type="suffix">ككمم</m></seg> ... </choice> </w> </phr> ... </text> </TEI>