language technologies for arabic and its · pdf filelanguage technologies for arabic and its...

66
Language Technologies for Arabic and its Dialects NYUAD Ins*tute Talk, November 15, 2016 Prof. Nizar Habash New York University Abu Dhabi [email protected] NYUAD CAMeL Lab

Upload: vucong

Post on 21-Mar-2018

250 views

Category:

Documents


7 download

TRANSCRIPT

Page 1: Language Technologies for Arabic and its  · PDF fileLanguage Technologies for Arabic and its Dialects ... English, Arabic, Chinese, Spanish ... • An Abjad (consonantal

LanguageTechnologiesforArabicanditsDialects

NYUADIns*tuteTalk,November15,2016

Prof.NizarHabashNewYorkUniversityAbuDhabi

[email protected]

N Y U A DCAMeL Lab

Page 2: Language Technologies for Arabic and its  · PDF fileLanguage Technologies for Arabic and its Dialects ... English, Arabic, Chinese, Spanish ... • An Abjad (consonantal

تقنياتاللغةالعربيةولهجاتها

محاضرةمعهدجامعةنيويوركأبوظبي

د.نزارحبشجامعةنيويوركأبوظبي

[email protected]

١٥-١١-٢٠١٦N Y U A DCAMeL Lab

Page 3: Language Technologies for Arabic and its  · PDF fileLanguage Technologies for Arabic and its Dialects ... English, Arabic, Chinese, Spanish ... • An Abjad (consonantal

3

Roadmap

• OnLanguageTechnologies• ArabicfromaTechnicalPerspec*ve• State-of-the-artArabicTechnology• SummaryandFutureDirec*ons

Page 4: Language Technologies for Arabic and its  · PDF fileLanguage Technologies for Arabic and its Dialects ... English, Arabic, Chinese, Spanish ... • An Abjad (consonantal

LanguageTechnologies

Page 5: Language Technologies for Arabic and its  · PDF fileLanguage Technologies for Arabic and its Dialects ... English, Arabic, Chinese, Spanish ... • An Abjad (consonantal

LanguageTechnologies•  Alsoknownas

–  NaturalLanguageProcessing–  Computa*onalLinguis*cs–  HumanLanguageTechnology

•  LanguageTechnologyisaninterdisciplinaryfield–  Computerscience,Linguis*cs,Cogni*vescience,psychology,pedagogy,mathema*cs,etc.

•  Languagetechnologiesweresomeoftheearliestapplica*onsofcomputerscience–  Cryptography–  MachineTransla*on

Page 6: Language Technologies for Arabic and its  · PDF fileLanguage Technologies for Arabic and its Dialects ... English, Arabic, Chinese, Spanish ... • An Abjad (consonantal

LanguageTechnologies•  Applica*ons

–  Informa*onretrieval–  Machinetransla*on–  Automa*cspeechrecogni*on&speechsynthesis–  Sen*mentandemo*onanalysis–  Dialoguesystems&chaOngagents–  Op*calcharacterrecogni*on–  Automa*cSummariza*on,etc.

•  Enablingtechnologies–  Tokeniza*on–  Part-of-speechtagging–  Syntac*cparsing–  Lemma*za*on–  Wordsensedisambigua*on,etc.

Page 7: Language Technologies for Arabic and its  · PDF fileLanguage Technologies for Arabic and its Dialects ... English, Arabic, Chinese, Spanish ... • An Abjad (consonantal

ParadigmsforLanguageTechnologies

•  Rule-basedApproaches– Linguistswriterulesthatareappliedbythemachines

•  MachineLearningApproaches– Corpus-based,Sta*s*calApproaches– Machineslearnthe“rules”fromtrainingdata

•  Machinelearningapproachesaredominantinthefield

Page 8: Language Technologies for Arabic and its  · PDF fileLanguage Technologies for Arabic and its Dialects ... English, Arabic, Chinese, Spanish ... • An Abjad (consonantal

Whatdoweneedtohelpmachineslearn?

•  Data,dataandmoredata!•  Specificallyannotateddata

ApplicaAon AnnotatedDataExample

MachineTransla*on Parallelcorpusintwolanguages:UNcorpuswithEnglish,Arabic,Chinese,Spanish,Russian,French

Sen*mentAnalysis Acorpusoftweetswithtagsindica*ng:posi*ve,nega*ve,neutral.

SpeechRecogni*on Acorpusofaudiofileswiththeircorrespondingtranscripts

Op*calCharacterRecogni*on

Acorpusofscannedbookpageimagesandtheircorrespondingtranscripts.

Part-of-Speech AnEnglishcorpuswithPart-of-Speechindicatedforeachword

Page 9: Language Technologies for Arabic and its  · PDF fileLanguage Technologies for Arabic and its Dialects ... English, Arabic, Chinese, Spanish ... • An Abjad (consonantal

ChallengesforMachineLearningLanguageTechnologies

•  Sizeoftrainingdata– MoreisbeZer!

•  Domainandgenresensi*vity– Systemstrainedonnewsdonotdowellonnovels

•  Qualityofannota*ons– Whyexpectgoodperformanceifhumansdonotagreewitheachotheronthetask

•  Developingrobustalgorithmsformachinelearningisessen*al

Page 10: Language Technologies for Arabic and its  · PDF fileLanguage Technologies for Arabic and its Dialects ... English, Arabic, Chinese, Spanish ... • An Abjad (consonantal

•  d

10

MachineLearningvs.HumanLearning

Predisposedforacquiringlanguagenot so!

Page 11: Language Technologies for Arabic and its  · PDF fileLanguage Technologies for Arabic and its Dialects ... English, Arabic, Chinese, Spanish ... • An Abjad (consonantal

ChallengesforMachineLearningLanguageTechnologies

•  Sizeoftrainingdata– MoreisbeZer!

•  Domainandgenresensi*vity–  Systemstrainedonnewsdonotdowellonnovels

•  Qualityofannota*ons– Whyexpectgoodperformanceifhumansdonotagreewitheachotheronthetask

•  DevelopingrobustalgorithmsformachinelearningisessenAal

Page 12: Language Technologies for Arabic and its  · PDF fileLanguage Technologies for Arabic and its Dialects ... English, Arabic, Chinese, Spanish ... • An Abjad (consonantal

12

Roadmap

• OnLanguageTechnologies• ArabicfromaTechnicalPerspecAve• State-of-the-artArabicTechnology• SummaryandFutureDirec*ons

Page 13: Language Technologies for Arabic and its  · PDF fileLanguage Technologies for Arabic and its Dialects ... English, Arabic, Chinese, Spanish ... • An Abjad (consonantal

13

Roadmap

• OnLanguageTechnologies• ArabicfromaTechnicalPerspecAve

– WriAngsystem– Wordstructure– Dialects

• State-of-the-artArabicTechnology• SummaryandFutureDirec*ons

Page 14: Language Technologies for Arabic and its  · PDF fileLanguage Technologies for Arabic and its Dialects ... English, Arabic, Chinese, Spanish ... • An Abjad (consonantal

14

ArabicScript

• AnAbjad(consonantalalphabetwithdiacri*cs)• WriZenright-to-led• LeZershavecontextualvariants• UsedtowritemanylanguagesbesidesArabic:Persian,Kurdish,Urdu,Pashto,etc.

العربي الخط

Page 15: Language Technologies for Arabic and its  · PDF fileLanguage Technologies for Arabic and its Dialects ... English, Arabic, Chinese, Spanish ... • An Abjad (consonantal

15

Unicode •  The international

encoding standard •  Widely supported input

and display •  Supports extended Arabic

characters •  Multi-script representation

الحب 0000011000100111000001100100010000000110001011010000011000101000

Page 16: Language Technologies for Arabic and its  · PDF fileLanguage Technologies for Arabic and its Dialects ... English, Arabic, Chinese, Spanish ... • An Abjad (consonantal

Arabic Input/Output

•  Letter-based keyboard •  Logical order input

– First-to-Last •  Visual display

protocols – Right-to-Left – Letter shaping

16

س ل ا م سالم

Page 17: Language Technologies for Arabic and its  · PDF fileLanguage Technologies for Arabic and its Dialects ... English, Arabic, Chinese, Spanish ... • An Abjad (consonantal

17

Display Problems

في حرة منطقة تدشنيااللكترونية للتجارة دبي

ع�ع��ظ�ظ�؛؟ظ�ع ع�ظ�ع�ظ�ظظ�ظ�ع�ع�ظ�ظ�ع�ع�ع�ظ�ظ،ظ�ظ�ظ�ظ�ع�ظ�ع�ع�ظ�ظ�ع�ع�ع�ظ�

ظ ظٹظ ط ؟طھطط ط ط ط ظ ط ظ

ظٹ ط ط ظپظٹط ط ط طھط ظ ظ

ظ ظ ط ظ ط طھطظٹط ظ ظ

� � ꠤǤǤ

في حرة منطقة تدشنيااللكترونية للتجارة دبي

هوتدش حرةو ةننتجارةدبل

ةوانانمتر

�䠣䘞 ݭኌ ǡǡߊ

حرة كلظ�ة تدشل� ففتجارة دب

افاف�ترنnلة

في حرة منطقة تدشنيااللكترونية للتجارة دبي

Western Unicode ISO-8859 CP-1256 Display Encoding

CP-

1256

IS

O-8

859

Uni

code

Act

ual E

ncod

ing

Page 18: Language Technologies for Arabic and its  · PDF fileLanguage Technologies for Arabic and its Dialects ... English, Arabic, Chinese, Spanish ... • An Abjad (consonantal

ArabicScript•  Arabicscriptusesasetofop*onaldiacri*cs

– Only1.5%ofwordshaveatleastonediacri*c

–  Combinable•  /kattab/ to dictate

Vowel NunaAon GeminaAon

ب/ba/ب/bu/ب/bi/

ب/b/ب/ban/ب/bun/

ب/bin/ ب

/bb/

كnتب

Page 19: Language Technologies for Arabic and its  · PDF fileLanguage Technologies for Arabic and its Dialects ... English, Arabic, Chinese, Spanish ... • An Abjad (consonantal

اسبانياتنفيتجميداملساعدةاملمنوحةللمغربمدريد1-11)افب(-اكدرئيسالحكومةاالسبانيةخوسيهماريا

اثناراليومالخميساناسبانيالمتوقفاملساعدةالتيتقدمهاللمغربخالفاملااكدهامساالربعاءوزيرالشؤونالخارجيةوالتعاوناملغربي

محمدبنعيسىاماممجلسالنواباملغربي.وقالرئيسالحكومةاالسبانيةفيمؤتمرصحافيانالتعاونبنياسبانياواملغربلميتوقف

ابداولميجمد.

اسبانياتnنفيتجميداملساعدةاملمnنوحةللمغربدرئيسالحكومةاالسبانيةخوسيهماريا مدريد1-11)افب(-اك

اسبانيالمتوقفاملساعدةالتيتقدمهاللمغرب اثناراليومالخميسان امساالربعاءوزيرالشؤونالخارجيةوالnتعاوناملغربي ده خالفnاملااك

.وقالرئيسالحكومة دبنعيسىاماممجnلسالnنواباملغربي محماسبانياواملغربلميnتوقف الnتعاونبني ان االسبانيةفيمؤتمرصحافي

د. ابداولميجم

Page 20: Language Technologies for Arabic and its  · PDF fileLanguage Technologies for Arabic and its Dialects ... English, Arabic, Chinese, Spanish ... • An Abjad (consonantal

20

OrthographicAmbiguity•  Arabicwordscanbeveryambiguousduetoop*onal

diacri*cs•  Buthowambiguous?•  Classicexample

thsswhtnrbctxtlkslkwthnvwlsthisiswhatanArabictextlookslikewithnovowels–  Notexactlytrue

•  LongvowelsarealwayswriZen•  Ini*alvowelsarerepresentedbyanا‘Alif’•  Somefinalshortvowelsaredeterminis*callyinferable

thsiswhtanArbctxtlkslikwthnovwls

•  Foracomputer…–  Awordonaveragehas12.3analyses,6.8diacriAzaAons,

and2.7lemmas(coremeanings)•  Notallofthisambiguityisduetoorthography!Moreonthislater.

Page 21: Language Technologies for Arabic and its  · PDF fileLanguage Technologies for Arabic and its Dialects ... English, Arabic, Chinese, Spanish ... • An Abjad (consonantal

•  TheQatarArabicLanguageBank(QALB,PIHabash)projectfoundaveryhigh(30%)ofwordshaveerrorsinuneditedStandardArabiccommentsonAljazeera.

•  Arabicspellingerrorsareabigchallengetolanguagetechnologies–  GIGO:GarbageInGarbageOut–  ErrorsinStandardArabic–  InconsistenciesinDialectalArabic(noofficialstandard)

•  Robustsystemsneedaddi*onalfunc*onalitytoallowforcorrec*ngerrorsorfunc*oningwelldespitethem.

SpellingErrors

21

Page 22: Language Technologies for Arabic and its  · PDF fileLanguage Technologies for Arabic and its Dialects ... English, Arabic, Chinese, Spanish ... • An Abjad (consonantal

22

Roadmap

• OnLanguageTechnologies• ArabicfromaTechnicalPerspecAve

– Wri*ngsystem– Wordstructure– Dialects

• State-of-the-artArabicTechnology• SummaryandFutureDirec*ons

Page 23: Language Technologies for Arabic and its  · PDF fileLanguage Technologies for Arabic and its Dialects ... English, Arabic, Chinese, Spanish ... • An Abjad (consonantal

MorphologicalComplexity

•  Arabicismorphologicallyrich– Acorewordhasmanyinflectedforms–  Example: Arabic Verbs have 5,400 forms

Gender(2), Number(3), Person(3), Aspect(3), Tense particle (2), Mood(3), Voice(2), Pronominal clitic(12), Conjunction clitic(3)

23

وسنقولها/wasanaqūluhā/

ها+قول+ن+س+وwa+sa+na+qūl+u+hāand+will+we+say+itAndwewillsayit

قال،قالت،قاال،قالوا،قلت،قلت،قلتما،قلتم،قلنت،

يقول،يقول،يقل،تقول،تقول،تقل،تقولني،تقولي،

...فقال،فقالت،فقاال...،...وسنقولها...وسأقولها،

Page 24: Language Technologies for Arabic and its  · PDF fileLanguage Technologies for Arabic and its Dialects ... English, Arabic, Chinese, Spanish ... • An Abjad (consonantal

MorphologicalComplexity

•  Englishisnotmorphologicallyrich.–  Thenumberofinflectedformsissmall–  Theverbparadigmislimitedto6

–  ThecompleteEnglishpart-of-speechtagsethas48tags

–  ThecompleteArabicpart-of-speechtagsethas22,400tags

24

VB VBD VBG VBN VBP VBZ go went going gone go goes

Page 25: Language Technologies for Arabic and its  · PDF fileLanguage Technologies for Arabic and its Dialects ... English, Arabic, Chinese, Spanish ... • An Abjad (consonantal

MorphologicalAmbiguity•  12.3 analyses and 2.7 lemmas per word •  Spelling ambiguity

–  Optional diacritics –  Suboptimal spelling, e.g., (أ, إ à ا) or (ة à ه ) –  Example: وبادلتها�

•  Derivational ambiguity and homonymy

+ها +أدلة +ب وandwithherpiecesofevidence

+ها +بادلت وandIexchangedwithher

العيـن theeye,thewaterspring,Al-Aincity,thenotable

املحتلoccupier,occupied

(العدواملحتل/الوطناملحتل/الدولاملحتلة)

Page 26: Language Technologies for Arabic and its  · PDF fileLanguage Technologies for Arabic and its Dialects ... English, Arabic, Chinese, Spanish ... • An Abjad (consonantal

Analysisvs.Disambigua*on

Will will Ben Affleck be a good Batman?

PV+PVSUFF_SUBJ:3MS بني Hedemonstrated

PV+PVSUFF_SUBJ:3FP بني Theydemonstrated(f.p)

NOUN_PROP بني Ben

ADJ بني Clear

PREP بني Between,among

PREP+NOUN_PROP بني InYen

Morphological Analysis is out-of-context Morphological Disambiguation is in-context

سينجح باتمان؟بنيهل دور في أفليك

Page 27: Language Technologies for Arabic and its  · PDF fileLanguage Technologies for Arabic and its Dialects ... English, Arabic, Chinese, Spanish ... • An Abjad (consonantal

Analysisvs.DisambiguaAon

Will Ben Affleck be a good Batman?

PV+PVSUFF_SUBJ:3MS بني Hedemonstrated

PV+PVSUFF_SUBJ:3FP بني Theydemonstrated(f.p)

NOUN_PROP بني Ben

ADJ بني Clear

PREP بني Between,among

PREP+NOUN_PROP بني InYen

Morphological Analysis is out-of-context Morphological Disambiguation is in-context

سينجح باتمان؟بنيهل دور في أفليك

*

Page 28: Language Technologies for Arabic and its  · PDF fileLanguage Technologies for Arabic and its Dialects ... English, Arabic, Chinese, Spanish ... • An Abjad (consonantal

28

Roadmap

• OnLanguageTechnologies• ArabicfromaTechnicalPerspecAve

– Wri*ngsystem– Wordstructure– Dialects

• State-of-the-artArabicTechnology• SummaryandFutureDirec*ons

Page 29: Language Technologies for Arabic and its  · PDF fileLanguage Technologies for Arabic and its Dialects ... English, Arabic, Chinese, Spanish ... • An Abjad (consonantal

29

ArabicanditsDialects•  Arabichas~360Mspeakers•  FormsofArabic

–  ClassicalArabic(CA)•  ClassicalHistoricaltexts•  Liturgicaltexts

–  ModernStandardArabic(MSA)•  Newsmedia&formalspeechesandseOngs•  OnlywriZenstandard

–  DialectalArabic(DA)•  Predominantlyspokenvernaculars•  NowriZenstandards

•  Diglossia–  Twoformsofthelanguageexistsidebyside

Page 30: Language Technologies for Arabic and its  · PDF fileLanguage Technologies for Arabic and its Dialects ... English, Arabic, Chinese, Spanish ... • An Abjad (consonantal

ArabicanditsDialects•  Officiallanguage:ModernStandardArabic(MSA)

Ø Noone’sna*velanguage•  RegionalDialects

–  Egyp*anArabic(EGY)–  Levan*neArabic(LEV)–  GulfArabic(GLF)–  NorthAfricanArabic(NOR):Moroccan,Algerian,Tunisian–  Iraqi,Yemenite,Sudanese

•  Dialectsandsub-dialects…–  City,Rural,Bedouin

Page 31: Language Technologies for Arabic and its  · PDF fileLanguage Technologies for Arabic and its Dialects ... English, Arabic, Chinese, Spanish ... • An Abjad (consonantal

DialectsorLanguages?•  Theargumentsfrompower

–  “alanguageisadialectwithanarmy”.–  Religion,na*onalism,regionalism,iden*ty

•  Theargumentsfromlinguis*cdifference.–  Degreesofmutualintelligibility–  “Theeagercommunicator”and“theeavesdroptest”

•  Theviewfromlanguagetechnology–  ThisquesGonisirrelevant.–  HowdowemodelArabicasadiglossicsystem?

•  Varia*onsandtheirfunc*on–  HowdowemodelhowArabscommunicate?

•  Behaviorandexpecta*on–  Canweexploitsimilari*esamongdialectsandbetweenMSAand

dialectstobuildbeZersystems?•  Technologyandpower

31

Page 32: Language Technologies for Arabic and its  · PDF fileLanguage Technologies for Arabic and its Dialects ... English, Arabic, Chinese, Spanish ... • An Abjad (consonantal

32

PhonologicalVaria*ons

•  Major variants

MSA Dialects ق /q/ /q/,/k/,/ʔ/,/g/,/ʤ/,/ɢ/ث /θ/ /θ/,/t/,/s/ذ /δ/ /δ/,/d/,/z/ج /ʤ/ /ʤ/,/g/,/ʒ/

Page 33: Language Technologies for Arabic and its  · PDF fileLanguage Technologies for Arabic and its Dialects ... English, Arabic, Chinese, Spanish ... • An Abjad (consonantal

SpellingInconsistency

مابيقولهاشمبيقولهاشمابقولهاشمبقولهاشمابيقلهاشمبيقلهاشمابقلهاشمبقلهاشمابيئولهاشمبيئولهاش 33

مبيnئلهاشمابnئلهاشمبnئلهاشمابيؤلهاشمبيؤلهاشمابؤلهاشمبؤلهاشمابئولهاشمبئولهاشمابيnئلهاش

Mabe’ulhashMabi’ulhashMabequlhashMabiqulhashMabeulhashMabiulhashMabe’ulhachMabi’ulhachMabequlhachMabiqulhach…

EgypAanArabicwordمابيقولهاش/mabiʔulhāʃ/

“hedoesnotsayit

If there is no standard, can a word be misspelled?

Page 34: Language Technologies for Arabic and its  · PDF fileLanguage Technologies for Arabic and its Dialects ... English, Arabic, Chinese, Spanish ... • An Abjad (consonantal

34

LexicalVaria*on

English Table Cat Of Iwant Hewillwrite Thereisn t

MSA Tāwilaطاولة

qiTTaقطة

idafaØ

uriduاريد

sayaktubuيكتبسـ

lāyujaduاليوجد

Moroccan midaميدة

qeTTaقطة

dyālديال

bγītبغيت

γajektebيكتبغـ

mākāynšماكاينش

EgypAan Tarabēzaطربيزة

oTTaقطة

bitāςبتاع

ςāwezعاوز

hayik*bيكتبهـ

maâšمفيش

Syrian Tāwleطاولة

bisseبسة

tabaςتبع

biddiبدي

Hayoktobيكتبحـ

māfiمافي

Iraqi mēzميز

bazzūnaبزونة

mālمال

arīdاريد

raHyik*bيكتبرح

mākuماكو

Page 35: Language Technologies for Arabic and its  · PDF fileLanguage Technologies for Arabic and its Dialects ... English, Arabic, Chinese, Spanish ... • An Abjad (consonantal

35

LexicalVaria*on

o براد EGY:keZle-LEV:fridgeo  مرا EGY:pros*tute-LEV:womano يnnnماش EGY/LEV:okay–MOR:noto طnnnبس EGY/LEV:makehappy–IRQ:beatupo شnnnبل LEV:start–SUD:end

Page 36: Language Technologies for Arabic and its  · PDF fileLanguage Technologies for Arabic and its Dialects ... English, Arabic, Chinese, Spanish ... • An Abjad (consonantal

36

MorphologicalVaria*on

•  Someaspectsofwordsaresimplifiedinthedialects–  Lossofcasemarking

–  Consolida*onofmasculineandfeminineplurals

–  Lossofsomedualforms

•  Otheraspectsincreaseincomplexity!

كتاب كتاب، كتابnا، كتاب، كتاب، كتاب، à كتاب

يكتنب يكتبون، يكتبوا، à يكتبون يكتبوا،

يكتبا يكتبان، à يكتبون يكتبوا،

Page 37: Language Technologies for Arabic and its  · PDF fileLanguage Technologies for Arabic and its Dialects ... English, Arabic, Chinese, Spanish ... • An Abjad (consonantal

37

MorphologicalVaria*onVerbMorphology

conjverbobject subj tense

IOBJ negneg

MSAولمتكتبوهاله

/walamtaktubūhālahu//wa+lamtaktubū+hāla+hu/and+not_pastwrite_you+itfor+him

EGYوماكتبتوهالوش

/wimakatabtuhalūʃ//wi+ma+katab+tu+ha+lū+ʃ/

and+not+wrote+you+it+for_him+not

Andyoudidn twriteitforhim

Page 38: Language Technologies for Arabic and its  · PDF fileLanguage Technologies for Arabic and its Dialects ... English, Arabic, Chinese, Spanish ... • An Abjad (consonantal

38

WhyWorkonArabicDialects?•  DialectsaretheprimaryformofArabicusedinallunscriptedspokengenres:conversa*onal,talkshows,interviews,etc.–  Speechrecogni*onsystemsmustmodeldialects

•  DialectsareincreasinglyinuseinnewwriZenmedia(newsgroups,weblogs,forumsetc.)–  Textanaly*csofArabicmustincludedialectalmodeling

•  Substan*alDialect-MSAdifferencesimpededirectapplica*onofMSANLPtools–  36%ofEgyp*anwordsarenotrecognizableusingMSAanalyzers(Habashetal.,2012)

Page 39: Language Technologies for Arabic and its  · PDF fileLanguage Technologies for Arabic and its Dialects ... English, Arabic, Chinese, Spanish ... • An Abjad (consonantal

39

Roadmap

• OnLanguageTechnologies• ArabicfromaTechnicalPerspec*ve• State-of-the-artArabicTechnology• SummaryandFutureDirec*ons

Page 40: Language Technologies for Arabic and its  · PDF fileLanguage Technologies for Arabic and its Dialects ... English, Arabic, Chinese, Spanish ... • An Abjad (consonantal

ComparingPerformance

•  Part-of-SpeechTaggingandSyntaxParsing

Resultsfrom(Björkelundetal.2013,Pashaetal.,2014,Weissetal,2015,Kumaretal.,2016)

–  LargegapbetweenEnglishandArabic;andbetweenStandardArabicandArabicdialects

– MoreresourcesandmoreresearcheffortsforEnglishcomparedtoArabic.

40

English StandardArabic EgypAanArabic

FullPart-of-Speech 97.6% 85.4% 75.5%

CorePOSPart-of-Speech 96.1% 91.1%

DependencySyntax 92.2% 86.2%

Page 41: Language Technologies for Arabic and its  · PDF fileLanguage Technologies for Arabic and its Dialects ... English, Arabic, Chinese, Spanish ... • An Abjad (consonantal

ComparingPerformance

•  MachineTranslaAon

–  Qualityofmachinetransla*onfromMSAismuchbeZerthaninthedialects

–  Themainreasonisavailabilityofparallelcorpora•  150millionwordsofparallelStandardArabic-Englishtextcomparedto1.5millionwordsofDialect-Englishtext(Zbibetal.,2012)

41

ArabicSourceText GoogleTranslate(Nov12,2016)

MSA ال يوجد كهرباء، ماذا حدث؟ Noelectricity,whathappened? EGY الكهربا اتقطعت، ليه كده بس؟ Atqtatelectricity,whylikeBs? LEV شكلو مفيش كهربا، ليش هيك؟ JoinedMafeeshlookslikeit,Whytheheck? IRQ شو ماكو كهرباء، خير؟ ShawMakuelectricity,okay?

Page 42: Language Technologies for Arabic and its  · PDF fileLanguage Technologies for Arabic and its Dialects ... English, Arabic, Chinese, Spanish ... • An Abjad (consonantal

ResourcesLinguis*cDataConsor*um

•  AllArabicresourcescomparedtoEnglishresourceswentfrom3.6%in2000to35%in2016

•  Arabicdialectresourcesaccountfor21%ofAllArabicresources•  Thesenumbersarenotcompleteofallresources,butfairlyrepresenta*ve.

42

0

50

100

150

200

250

300

350

400

450

ArabicDialects AllArabic English

Page 43: Language Technologies for Arabic and its  · PDF fileLanguage Technologies for Arabic and its Dialects ... English, Arabic, Chinese, Spanish ... • An Abjad (consonantal

PublicaAonsGoogleScholarNaturalLanguageProcessing

Publica*onsonArabicdialect,AllArabicandEnglish

0

20000

40000

60000

80000

100000

120000

2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016

ArabicDialectNLP ArabicNLP EnglishNLP

•  Onaveragepublica*onsonArabicareequalto6%ofpublica*onsonEnglish

•  Arabicdialectspublica*onswentfrom21%ofallArabicpublica*onsin2000to50%in2016(overall37%)

Page 44: Language Technologies for Arabic and its  · PDF fileLanguage Technologies for Arabic and its Dialects ... English, Arabic, Chinese, Spanish ... • An Abjad (consonantal

0

20000

40000

60000

80000

100000

120000

0

1000

2000

3000

4000

5000

6000

7000

8000

2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016

ArabicDialectNLP ArabicNLP EnglishNLP

PublicaAonsGoogleScholarNaturalLanguageProcessing

Publica*onsonArabicdialect,AllArabicandEnglish

•  Onaveragepublica*onsonArabicareequalto6%ofpublica*onsonEnglish

•  Arabicdialectspublica*onswentfrom21%ofallArabicpublica*onsin2000to50%in2016(overall37%)

Page 45: Language Technologies for Arabic and its  · PDF fileLanguage Technologies for Arabic and its Dialects ... English, Arabic, Chinese, Spanish ... • An Abjad (consonantal

PublicaAonsGoogleScholarNaturalLanguageProcessing

Publica*onsonaNumberofLanguages

•  ManylanguageslagbehindEnglish•  ThenumberofGermanna*vespeakersislessthanthirdof

thenumberofArabicna*vespeakers,butGermanhasovertwicethepublica*onscountofArabic

Language PublicaAonssince2000English 107,930French 17,700Chinese 17,500German 16,800Spanish 15,600Arabic 7,019ArabicDialects 2,595

All numbers are over publications in English

Page 46: Language Technologies for Arabic and its  · PDF fileLanguage Technologies for Arabic and its Dialects ... English, Arabic, Chinese, Spanish ... • An Abjad (consonantal

Computa*onalProcessingofStandardandDialectalArabic

•  TherehasbeengrowingamountofworkonArabicprocessing–  Mul*plemorphologicalanalyzers,taggersandautoma*cannota*ontools

•  BAMA/SAMA,Elixir,AlKhalil,ALMOR,MADAMIRA,CALIMAetc.•  AIDA(dialectIden*fica*on),3arrib(Arabizi-to-Arabic),etc.

–  Mul*pletreebanksandparsers•  PennATB,CATiB,QuranCorpus,ARZ-TB,Stanfordparser,Camelparser,etc.

–  Largecollec*onsofmonolingualtextwithorwithoutannota*ons•  Gigaword,newscollec*ons,QALB,YADAC,Curras,Gumar,etc.

–  Largecollec*onsofbilingual/mul*lingualtextandlexicons•  UNcorpus,newscollec*ons,mul*-dialectcorpora,Tharwa,ArabAquis,etc.

–  Sen*mentResources•  ArSenL,SLSA,SAMAR,etc.

–  Nottomen*onthetradi*onalresourcesonlexicography,morphologyandsyntax!

•  IngeneralmoreisdoneonStandardArabicthanthedialects.46

Page 47: Language Technologies for Arabic and its  · PDF fileLanguage Technologies for Arabic and its Dialects ... English, Arabic, Chinese, Spanish ... • An Abjad (consonantal

Examples

•  SomeexamplesofongoingprojectsonArabiclanguageprocessing–  Conven*onalOrthographyforDialectalArabic– MADAMIRAArabictagger– GumarProject–  SAMERProject– MADARProject

47

Page 48: Language Technologies for Arabic and its  · PDF fileLanguage Technologies for Arabic and its Dialects ... English, Arabic, Chinese, Spanish ... • An Abjad (consonantal

CODAAConven*onalOrthography

forDialectalArabic•  Developed for computational processing purposes

(Habash et al, 2012) •  Objectives

–  CODA covers all Arabic dialects in principle –  CODA minimizes differences in choices –  CODA is easy to learn and produce consistently –  CODA is intuitive to readers unfamiliar with it –  CODA uses Arabic script

•  Current manuals for Egyptian, Tunisian, Levantine, Algerian, and Gulf

48

Page 49: Language Technologies for Arabic and its  · PDF fileLanguage Technologies for Arabic and its Dialects ... English, Arabic, Chinese, Spanish ... • An Abjad (consonantal

CODAExamples

CODA االمتحانات قبل اللي الفترة صحابي ماشفتش

gloss the exams before which the period my friends I did not see

Spelling variants

متحاناتإلا بلأ ـىاللـ هالفتر ـىصحابـ شفتشما

ـمتحاناتلـا بلا لليإ ةرطـالفـ حابيوصـ شفتشمـ

ناتـحـاالمتـ abl ـىللـإ هرطـالفـ ـىحابـوصـ فتشوماشـ

ناتـحـمتـإلا qbl ـيلـا ildra Su7abi فتشوشـما

ناتـحــمتـلـا qabl لىا sohaby فتشوشـمـ

ilim*7anat ـيإلـ mashodish

lim*hanaat إلى

illi

Page 50: Language Technologies for Arabic and its  · PDF fileLanguage Technologies for Arabic and its Dialects ... English, Arabic, Chinese, Spanish ... • An Abjad (consonantal

MADAMIRA•  State-of-the-artArabicandArabicDialect

Processingtool(Pashaetal.,2014)–  Collabora*veeffort

•  ColumbiaUniversity(Rambow)•  GeorgeWashingtonUniversity(Diab)•  NewYorkUniversityAbuDhabi(Habash)

–  Morphologicaldisambigua*on–  Tokeniza*on–  Basephrasechunking–  Nameden*tyrecogni*on

•  MSAandEgyp*anArabicmodes•  Server-modewithXMLinterface

InputArabicText

MorphologicalDisambiguaAon

TokenizaAon

BasePhraseChunking

NamedEnAtyRecogniAon

UserNLPApplicaAons

Page 51: Language Technologies for Arabic and its  · PDF fileLanguage Technologies for Arabic and its Dialects ... English, Arabic, Chinese, Spanish ... • An Abjad (consonantal

MADAMIRAhZp://camel.abudhabi.nyu.edu/madamira/

 ي •

Page 52: Language Technologies for Arabic and its  · PDF fileLanguage Technologies for Arabic and its Dialects ... English, Arabic, Chinese, Spanish ... • An Abjad (consonantal

MADAMIRAhZp://camel.abudhabi.nyu.edu/madamira/

Page 53: Language Technologies for Arabic and its  · PDF fileLanguage Technologies for Arabic and its Dialects ... English, Arabic, Chinese, Spanish ... • An Abjad (consonantal

MADAMIRAhZp://camel.abudhabi.nyu.edu/madamira/

Page 54: Language Technologies for Arabic and its  · PDF fileLanguage Technologies for Arabic and its Dialects ... English, Arabic, Chinese, Spanish ... • An Abjad (consonantal

MADAMIRAMorphologicalDisambigua*on

System: MSA MSA EGY

Test: MSA EGY EGY

FullAnalysis 84.3% 27.0% 75.4%

DiacriAcizaAon 86.4% 32.2% 83.2%

LemmaAzaAon 96.1% 67.1% 86.3%

BasePOS-tagging 96.1% 82.1% 91.1%

ATBSegmentaAon 99.1% 90.5% 97.4%

wakAtibuhu kAtib_1 pos:noun prc3:0 prc2:wa_conj prc1:0 prc0:0 per:3 asp:na vox:na mod:na gen:m num:s stt:c cas:n enc0:pron3ms

w+ kAtb +h

wkAtbhوكاتبهand his writer

Page 55: Language Technologies for Arabic and its  · PDF fileLanguage Technologies for Arabic and its Dialects ... English, Arabic, Chinese, Spanish ... • An Abjad (consonantal

TheGumarCorpus•  100millionwordsofmostlyGulfArabicconversa*onalnovelspublished

anonymouslyonline( النتروايات ‘Internetnovels’)(Khalifaetal.,2016)•  NYUADREFFundedtoannotate200Kwordsmanually(Habash).

–  Wearehiringannotators!

•  GumarCorpusBrowser:hZp://camel.abudhabi.nyu.edu/gumar/

Page 56: Language Technologies for Arabic and its  · PDF fileLanguage Technologies for Arabic and its Dialects ... English, Arabic, Chinese, Spanish ... • An Abjad (consonantal

SAMERProject•  Simplifica*onofArabicMasterpiecesforExtensiveReading– MuhamedAlKhalil,NizarHabashandDrisSulaimani– NYUADREFfundingfortwoyears(startSep2016)

•  Objec*ves– Createastandardforthesimplifica*onofmodernfic*oninArabictoschool-agelearners.

– Developatoolforautoma*ngreadabilityscalegradingforArabic

– SimplifyanumberofArabicfic*onmasterpieces•  Publiccompe**on

Page 57: Language Technologies for Arabic and its  · PDF fileLanguage Technologies for Arabic and its Dialects ... English, Arabic, Chinese, Spanish ... • An Abjad (consonantal

MADARProject•  Mul*-ArabicDialectApplica*onsandResources

•  FundedbytheQatarNa*onalResearchFund•  Collabora*onamongCMUQ,NYUADandColumbia– NizarHabash,HoudaBouamor,KemalOflazerandOwenRambow

•  Modeling25Arabiccitydialects– Lexicalresources,paralleldata,dialectiden*fica*on,anddialectmachinetransla*on

•  Lookingforlinguists!

Page 58: Language Technologies for Arabic and its  · PDF fileLanguage Technologies for Arabic and its Dialects ... English, Arabic, Chinese, Spanish ... • An Abjad (consonantal

FirstWorkshoponArabicDialectTechnologies

•  SponsoredbytheNYUADIns*tute•  Aresearchun-conference•  30leadingresearchersonArabiccomputa*onallinguis*cs

•  Discussthestateofthefieldandplanitsfuture

•  hZp://wardat2016.arabic-nlp.net/

Page 59: Language Technologies for Arabic and its  · PDF fileLanguage Technologies for Arabic and its Dialects ... English, Arabic, Chinese, Spanish ... • An Abjad (consonantal

CAMeLLab

59

•  Computa*onalApproachestoModelingLanguage•  AnewNLP/CLlabatNYUAbuDhabi

–  ArabiccoreNLP(morphology,syntax)–  Arabicdialectmodeling–  Machinetransla*on–  Informa*onretrieval

•  Wearehiring!!–  ResearchScien*sts,Postdocs,ResearchAssistants

•  [email protected]•  hZp://www.camel-lab.com

Page 60: Language Technologies for Arabic and its  · PDF fileLanguage Technologies for Arabic and its Dialects ... English, Arabic, Chinese, Spanish ... • An Abjad (consonantal

60

Roadmap

• OnLanguageTechnologies•  ArabicfromaTechnicalPerspec*ve•  State-of-the-artArabicTechnology•  SummaryandFutureDirecAons

Page 61: Language Technologies for Arabic and its  · PDF fileLanguage Technologies for Arabic and its Dialects ... English, Arabic, Chinese, Spanish ... • An Abjad (consonantal

Summary•  Arabicposesmanychallengestolanguagetechnologies– Orthographicambiguity

•  Under-specifica*onandinconsistency– Morphologicalcomplexity

•  Richandcomplexsystemoffeatures– Enormousvariety

•  Manydialectsandsub-dialects,codeswitching– Annotatedresourcepoverty

•  TherehasbeenalotofworkonArabicandArabicdialecttechnologies.– Butthecurrentperformancelevelsnotacceptable.

Page 62: Language Technologies for Arabic and its  · PDF fileLanguage Technologies for Arabic and its Dialects ... English, Arabic, Chinese, Spanish ... • An Abjad (consonantal

FutureDirec*ons•  ThefieldofArabiclanguagetechnologiesneedsalotofsupporttokeepgrowing.– Moreresearchersanddevelopers

•  Computa*onallinguis*csuniversityprograms– Moreresourceandknowledgesharing

•  Opensource,non-commerciallicensemodels•  MoreconferencesforArabiclanguagetechnologies•  Coordina*onoftechnologicalstandards

– Morefundingtosupportacademicresearchersandstartups

•  Buildmoreresourcesandmoretools•  Encouragecollabora*onsacrossuniversi*esandamonguniversi*esandcompanies

Page 63: Language Technologies for Arabic and its  · PDF fileLanguage Technologies for Arabic and its Dialects ... English, Arabic, Chinese, Spanish ... • An Abjad (consonantal

FutureDirec*ons•  AculturearoundArabicthatexpectsitslanguagetechnologytobe–  Highquality

•  Robust,human-quality

–  Seamless•  Wellintegratedinotherapplica*ons

–  Suppor*ve•  languagetechnologytohelptheblindorhearingimpaired•  Languagetechnologyforpedagoy

–  Personalizable•  Understandmanydialects,speakmanydialects

Page 64: Language Technologies for Arabic and its  · PDF fileLanguage Technologies for Arabic and its Dialects ... English, Arabic, Chinese, Spanish ... • An Abjad (consonantal

Whattoexpect(linguis*cally)fromarobot?

مع مشاكل أريد ال سرعه على  حافظ -السلطات مفهوم -ال  ال -

الناس يتكلم كيف إسمعمثلها شىء أى او مفهوم تقول ال

مشكله توجد ال قل إلتهمنى قل أحد عليك تعدى لو

-  Keepitunder65.Wedon'twanttobepulledover.

-  Affirma*ve.-  No,no,no.

Yougottolistentothewaypeopletalk.Youdon'tsay"affirma*ve”orsomeshitlikethat.Yousay"Noproblemo”.IfsomeonecomesuptoyouwithanaOtude,yousay"Eatme".

Terminator 2: John Connor and T-800

Page 65: Language Technologies for Arabic and its  · PDF fileLanguage Technologies for Arabic and its Dialects ... English, Arabic, Chinese, Spanish ... • An Abjad (consonantal

Acknowledgements•  ColumbiaUniversity(Rambow,Eskander,Alkholy,Salloum,

Alfardy,Altantawy)•  TheGeorgeWashingtonUniversity(Diab,Hawwari,

Badrashiny)•  CarnegieMellonUniversityQatar(Bouamor,Zaghouani,

Obied,Oflazer)•  BirzeitUniversity(Jarrar,Rimawi)•  AmericanUniversityofBeirut(Hajj,Baly,Badaro)•  UniversityofBahrain(Abdulrahim)•  NewYorkUniversityAbuDhabi(AlKhalil,Soulamani)•  AndtheNYUADCAMeLeers(Shahrour,Khalifa,Taji,Hasan,

Zalmout,Saddiki,Erdmann)

Page 66: Language Technologies for Arabic and its  · PDF fileLanguage Technologies for Arabic and its Dialects ... English, Arabic, Chinese, Spanish ... • An Abjad (consonantal

•  hZp://nyuad.nyu.edu/en/

66

Thank You! Questions?