language technologies for arabic and its · pdf filelanguage technologies for arabic and its...
TRANSCRIPT
LanguageTechnologiesforArabicanditsDialects
NYUADIns*tuteTalk,November15,2016
Prof.NizarHabashNewYorkUniversityAbuDhabi
N Y U A DCAMeL Lab
تقنياتاللغةالعربيةولهجاتها
محاضرةمعهدجامعةنيويوركأبوظبي
د.نزارحبشجامعةنيويوركأبوظبي
١٥-١١-٢٠١٦N Y U A DCAMeL Lab
3
Roadmap
• OnLanguageTechnologies• ArabicfromaTechnicalPerspec*ve• State-of-the-artArabicTechnology• SummaryandFutureDirec*ons
LanguageTechnologies
LanguageTechnologies• Alsoknownas
– NaturalLanguageProcessing– Computa*onalLinguis*cs– HumanLanguageTechnology
• LanguageTechnologyisaninterdisciplinaryfield– Computerscience,Linguis*cs,Cogni*vescience,psychology,pedagogy,mathema*cs,etc.
• Languagetechnologiesweresomeoftheearliestapplica*onsofcomputerscience– Cryptography– MachineTransla*on
LanguageTechnologies• Applica*ons
– Informa*onretrieval– Machinetransla*on– Automa*cspeechrecogni*on&speechsynthesis– Sen*mentandemo*onanalysis– Dialoguesystems&chaOngagents– Op*calcharacterrecogni*on– Automa*cSummariza*on,etc.
• Enablingtechnologies– Tokeniza*on– Part-of-speechtagging– Syntac*cparsing– Lemma*za*on– Wordsensedisambigua*on,etc.
ParadigmsforLanguageTechnologies
• Rule-basedApproaches– Linguistswriterulesthatareappliedbythemachines
• MachineLearningApproaches– Corpus-based,Sta*s*calApproaches– Machineslearnthe“rules”fromtrainingdata
• Machinelearningapproachesaredominantinthefield
Whatdoweneedtohelpmachineslearn?
• Data,dataandmoredata!• Specificallyannotateddata
ApplicaAon AnnotatedDataExample
MachineTransla*on Parallelcorpusintwolanguages:UNcorpuswithEnglish,Arabic,Chinese,Spanish,Russian,French
Sen*mentAnalysis Acorpusoftweetswithtagsindica*ng:posi*ve,nega*ve,neutral.
SpeechRecogni*on Acorpusofaudiofileswiththeircorrespondingtranscripts
Op*calCharacterRecogni*on
Acorpusofscannedbookpageimagesandtheircorrespondingtranscripts.
Part-of-Speech AnEnglishcorpuswithPart-of-Speechindicatedforeachword
ChallengesforMachineLearningLanguageTechnologies
• Sizeoftrainingdata– MoreisbeZer!
• Domainandgenresensi*vity– Systemstrainedonnewsdonotdowellonnovels
• Qualityofannota*ons– Whyexpectgoodperformanceifhumansdonotagreewitheachotheronthetask
• Developingrobustalgorithmsformachinelearningisessen*al
• d
10
MachineLearningvs.HumanLearning
Predisposedforacquiringlanguagenot so!
ChallengesforMachineLearningLanguageTechnologies
• Sizeoftrainingdata– MoreisbeZer!
• Domainandgenresensi*vity– Systemstrainedonnewsdonotdowellonnovels
• Qualityofannota*ons– Whyexpectgoodperformanceifhumansdonotagreewitheachotheronthetask
• DevelopingrobustalgorithmsformachinelearningisessenAal
12
Roadmap
• OnLanguageTechnologies• ArabicfromaTechnicalPerspecAve• State-of-the-artArabicTechnology• SummaryandFutureDirec*ons
13
Roadmap
• OnLanguageTechnologies• ArabicfromaTechnicalPerspecAve
– WriAngsystem– Wordstructure– Dialects
• State-of-the-artArabicTechnology• SummaryandFutureDirec*ons
14
ArabicScript
• AnAbjad(consonantalalphabetwithdiacri*cs)• WriZenright-to-led• LeZershavecontextualvariants• UsedtowritemanylanguagesbesidesArabic:Persian,Kurdish,Urdu,Pashto,etc.
العربي الخط
15
Unicode • The international
encoding standard • Widely supported input
and display • Supports extended Arabic
characters • Multi-script representation
الحب 0000011000100111000001100100010000000110001011010000011000101000
Arabic Input/Output
• Letter-based keyboard • Logical order input
– First-to-Last • Visual display
protocols – Right-to-Left – Letter shaping
16
س ل ا م سالم
17
Display Problems
�
في حرة منطقة تدشنيااللكترونية للتجارة دبي
ع�ع��ظ�ظ�؛؟ظ�ع ع�ظ�ع�ظ�ظظ�ظ�ع�ع�ظ�ظ�ع�ع�ع�ظ�ظ،ظ�ظ�ظ�ظ�ع�ظ�ع�ع�ظ�ظ�ع�ع�ع�ظ�
ظ ظٹظ ط ؟طھطط ط ط ط ظ ط ظ
ظٹ ط ط ظپظٹط ط ط طھط ظ ظ
ظ ظ ط ظ ط طھطظٹط ظ ظ
� � ꠤǤǤ
في حرة منطقة تدشنيااللكترونية للتجارة دبي
هوتدش حرةو ةننتجارةدبل
ةوانانمتر
�䠣䘞 ݭኌ ǡǡߊ
حرة كلظ�ة تدشل� ففتجارة دب
افاف�ترنnلة
في حرة منطقة تدشنيااللكترونية للتجارة دبي
Western Unicode ISO-8859 CP-1256 Display Encoding
CP-
1256
IS
O-8
859
Uni
code
Act
ual E
ncod
ing
ArabicScript• Arabicscriptusesasetofop*onaldiacri*cs
– Only1.5%ofwordshaveatleastonediacri*c
– Combinable• /kattab/ to dictate
Vowel NunaAon GeminaAon
ب/ba/ب/bu/ب/bi/
ب/b/ب/ban/ب/bun/
ب/bin/ ب
/bb/
كnتب
اسبانياتنفيتجميداملساعدةاملمنوحةللمغربمدريد1-11)افب(-اكدرئيسالحكومةاالسبانيةخوسيهماريا
اثناراليومالخميساناسبانيالمتوقفاملساعدةالتيتقدمهاللمغربخالفاملااكدهامساالربعاءوزيرالشؤونالخارجيةوالتعاوناملغربي
محمدبنعيسىاماممجلسالنواباملغربي.وقالرئيسالحكومةاالسبانيةفيمؤتمرصحافيانالتعاونبنياسبانياواملغربلميتوقف
ابداولميجمد.
اسبانياتnنفيتجميداملساعدةاملمnنوحةللمغربدرئيسالحكومةاالسبانيةخوسيهماريا مدريد1-11)افب(-اك
اسبانيالمتوقفاملساعدةالتيتقدمهاللمغرب اثناراليومالخميسان امساالربعاءوزيرالشؤونالخارجيةوالnتعاوناملغربي ده خالفnاملااك
.وقالرئيسالحكومة دبنعيسىاماممجnلسالnنواباملغربي محماسبانياواملغربلميnتوقف الnتعاونبني ان االسبانيةفيمؤتمرصحافي
د. ابداولميجم
20
OrthographicAmbiguity• Arabicwordscanbeveryambiguousduetoop*onal
diacri*cs• Buthowambiguous?• Classicexample
thsswhtnrbctxtlkslkwthnvwlsthisiswhatanArabictextlookslikewithnovowels– Notexactlytrue
• LongvowelsarealwayswriZen• Ini*alvowelsarerepresentedbyanا‘Alif’• Somefinalshortvowelsaredeterminis*callyinferable
thsiswhtanArbctxtlkslikwthnovwls
• Foracomputer…– Awordonaveragehas12.3analyses,6.8diacriAzaAons,
and2.7lemmas(coremeanings)• Notallofthisambiguityisduetoorthography!Moreonthislater.
• TheQatarArabicLanguageBank(QALB,PIHabash)projectfoundaveryhigh(30%)ofwordshaveerrorsinuneditedStandardArabiccommentsonAljazeera.
• Arabicspellingerrorsareabigchallengetolanguagetechnologies– GIGO:GarbageInGarbageOut– ErrorsinStandardArabic– InconsistenciesinDialectalArabic(noofficialstandard)
• Robustsystemsneedaddi*onalfunc*onalitytoallowforcorrec*ngerrorsorfunc*oningwelldespitethem.
SpellingErrors
21
22
Roadmap
• OnLanguageTechnologies• ArabicfromaTechnicalPerspecAve
– Wri*ngsystem– Wordstructure– Dialects
• State-of-the-artArabicTechnology• SummaryandFutureDirec*ons
MorphologicalComplexity
• Arabicismorphologicallyrich– Acorewordhasmanyinflectedforms– Example: Arabic Verbs have 5,400 forms
Gender(2), Number(3), Person(3), Aspect(3), Tense particle (2), Mood(3), Voice(2), Pronominal clitic(12), Conjunction clitic(3)
23
وسنقولها/wasanaqūluhā/
ها+قول+ن+س+وwa+sa+na+qūl+u+hāand+will+we+say+itAndwewillsayit
قال،قالت،قاال،قالوا،قلت،قلت،قلتما،قلتم،قلنت،
يقول،يقول،يقل،تقول،تقول،تقل،تقولني،تقولي،
...فقال،فقالت،فقاال...،...وسنقولها...وسأقولها،
MorphologicalComplexity
• Englishisnotmorphologicallyrich.– Thenumberofinflectedformsissmall– Theverbparadigmislimitedto6
– ThecompleteEnglishpart-of-speechtagsethas48tags
– ThecompleteArabicpart-of-speechtagsethas22,400tags
24
VB VBD VBG VBN VBP VBZ go went going gone go goes
MorphologicalAmbiguity• 12.3 analyses and 2.7 lemmas per word • Spelling ambiguity
– Optional diacritics – Suboptimal spelling, e.g., (أ, إ à ا) or (ة à ه ) – Example: وبادلتها�
• Derivational ambiguity and homonymy
+ها +أدلة +ب وandwithherpiecesofevidence
+ها +بادلت وandIexchangedwithher
العيـن theeye,thewaterspring,Al-Aincity,thenotable
املحتلoccupier,occupied
(العدواملحتل/الوطناملحتل/الدولاملحتلة)
Analysisvs.Disambigua*on
Will will Ben Affleck be a good Batman?
PV+PVSUFF_SUBJ:3MS بني Hedemonstrated
PV+PVSUFF_SUBJ:3FP بني Theydemonstrated(f.p)
NOUN_PROP بني Ben
ADJ بني Clear
PREP بني Between,among
PREP+NOUN_PROP بني InYen
Morphological Analysis is out-of-context Morphological Disambiguation is in-context
سينجح باتمان؟بنيهل دور في أفليك
Analysisvs.DisambiguaAon
Will Ben Affleck be a good Batman?
PV+PVSUFF_SUBJ:3MS بني Hedemonstrated
PV+PVSUFF_SUBJ:3FP بني Theydemonstrated(f.p)
NOUN_PROP بني Ben
ADJ بني Clear
PREP بني Between,among
PREP+NOUN_PROP بني InYen
Morphological Analysis is out-of-context Morphological Disambiguation is in-context
سينجح باتمان؟بنيهل دور في أفليك
*
28
Roadmap
• OnLanguageTechnologies• ArabicfromaTechnicalPerspecAve
– Wri*ngsystem– Wordstructure– Dialects
• State-of-the-artArabicTechnology• SummaryandFutureDirec*ons
29
ArabicanditsDialects• Arabichas~360Mspeakers• FormsofArabic
– ClassicalArabic(CA)• ClassicalHistoricaltexts• Liturgicaltexts
– ModernStandardArabic(MSA)• Newsmedia&formalspeechesandseOngs• OnlywriZenstandard
– DialectalArabic(DA)• Predominantlyspokenvernaculars• NowriZenstandards
• Diglossia– Twoformsofthelanguageexistsidebyside
ArabicanditsDialects• Officiallanguage:ModernStandardArabic(MSA)
Ø Noone’sna*velanguage• RegionalDialects
– Egyp*anArabic(EGY)– Levan*neArabic(LEV)– GulfArabic(GLF)– NorthAfricanArabic(NOR):Moroccan,Algerian,Tunisian– Iraqi,Yemenite,Sudanese
• Dialectsandsub-dialects…– City,Rural,Bedouin
DialectsorLanguages?• Theargumentsfrompower
– “alanguageisadialectwithanarmy”.– Religion,na*onalism,regionalism,iden*ty
• Theargumentsfromlinguis*cdifference.– Degreesofmutualintelligibility– “Theeagercommunicator”and“theeavesdroptest”
• Theviewfromlanguagetechnology– ThisquesGonisirrelevant.– HowdowemodelArabicasadiglossicsystem?
• Varia*onsandtheirfunc*on– HowdowemodelhowArabscommunicate?
• Behaviorandexpecta*on– Canweexploitsimilari*esamongdialectsandbetweenMSAand
dialectstobuildbeZersystems?• Technologyandpower
31
32
PhonologicalVaria*ons
• Major variants
MSA Dialects ق /q/ /q/,/k/,/ʔ/,/g/,/ʤ/,/ɢ/ث /θ/ /θ/,/t/,/s/ذ /δ/ /δ/,/d/,/z/ج /ʤ/ /ʤ/,/g/,/ʒ/
SpellingInconsistency
مابيقولهاشمبيقولهاشمابقولهاشمبقولهاشمابيقلهاشمبيقلهاشمابقلهاشمبقلهاشمابيئولهاشمبيئولهاش 33
مبيnئلهاشمابnئلهاشمبnئلهاشمابيؤلهاشمبيؤلهاشمابؤلهاشمبؤلهاشمابئولهاشمبئولهاشمابيnئلهاش
Mabe’ulhashMabi’ulhashMabequlhashMabiqulhashMabeulhashMabiulhashMabe’ulhachMabi’ulhachMabequlhachMabiqulhach…
EgypAanArabicwordمابيقولهاش/mabiʔulhāʃ/
“hedoesnotsayit
If there is no standard, can a word be misspelled?
34
LexicalVaria*on
English Table Cat Of Iwant Hewillwrite Thereisn t
MSA Tāwilaطاولة
qiTTaقطة
idafaØ
uriduاريد
sayaktubuيكتبسـ
lāyujaduاليوجد
Moroccan midaميدة
qeTTaقطة
dyālديال
bγītبغيت
γajektebيكتبغـ
mākāynšماكاينش
EgypAan Tarabēzaطربيزة
oTTaقطة
bitāςبتاع
ςāwezعاوز
hayik*bيكتبهـ
maâšمفيش
Syrian Tāwleطاولة
bisseبسة
tabaςتبع
biddiبدي
Hayoktobيكتبحـ
māfiمافي
Iraqi mēzميز
bazzūnaبزونة
mālمال
arīdاريد
raHyik*bيكتبرح
mākuماكو
35
LexicalVaria*on
o براد EGY:keZle-LEV:fridgeo مرا EGY:pros*tute-LEV:womano يnnnماش EGY/LEV:okay–MOR:noto طnnnبس EGY/LEV:makehappy–IRQ:beatupo شnnnبل LEV:start–SUD:end
36
MorphologicalVaria*on
• Someaspectsofwordsaresimplifiedinthedialects– Lossofcasemarking
– Consolida*onofmasculineandfeminineplurals
– Lossofsomedualforms
• Otheraspectsincreaseincomplexity!
كتاب كتاب، كتابnا، كتاب، كتاب، كتاب، à كتاب
يكتنب يكتبون، يكتبوا، à يكتبون يكتبوا،
يكتبا يكتبان، à يكتبون يكتبوا،
37
MorphologicalVaria*onVerbMorphology
conjverbobject subj tense
IOBJ negneg
MSAولمتكتبوهاله
/walamtaktubūhālahu//wa+lamtaktubū+hāla+hu/and+not_pastwrite_you+itfor+him
EGYوماكتبتوهالوش
/wimakatabtuhalūʃ//wi+ma+katab+tu+ha+lū+ʃ/
and+not+wrote+you+it+for_him+not
Andyoudidn twriteitforhim
38
WhyWorkonArabicDialects?• DialectsaretheprimaryformofArabicusedinallunscriptedspokengenres:conversa*onal,talkshows,interviews,etc.– Speechrecogni*onsystemsmustmodeldialects
• DialectsareincreasinglyinuseinnewwriZenmedia(newsgroups,weblogs,forumsetc.)– Textanaly*csofArabicmustincludedialectalmodeling
• Substan*alDialect-MSAdifferencesimpededirectapplica*onofMSANLPtools– 36%ofEgyp*anwordsarenotrecognizableusingMSAanalyzers(Habashetal.,2012)
39
Roadmap
• OnLanguageTechnologies• ArabicfromaTechnicalPerspec*ve• State-of-the-artArabicTechnology• SummaryandFutureDirec*ons
ComparingPerformance
• Part-of-SpeechTaggingandSyntaxParsing
Resultsfrom(Björkelundetal.2013,Pashaetal.,2014,Weissetal,2015,Kumaretal.,2016)
– LargegapbetweenEnglishandArabic;andbetweenStandardArabicandArabicdialects
– MoreresourcesandmoreresearcheffortsforEnglishcomparedtoArabic.
40
English StandardArabic EgypAanArabic
FullPart-of-Speech 97.6% 85.4% 75.5%
CorePOSPart-of-Speech 96.1% 91.1%
DependencySyntax 92.2% 86.2%
ComparingPerformance
• MachineTranslaAon
– Qualityofmachinetransla*onfromMSAismuchbeZerthaninthedialects
– Themainreasonisavailabilityofparallelcorpora• 150millionwordsofparallelStandardArabic-Englishtextcomparedto1.5millionwordsofDialect-Englishtext(Zbibetal.,2012)
41
ArabicSourceText GoogleTranslate(Nov12,2016)
MSA ال يوجد كهرباء، ماذا حدث؟ Noelectricity,whathappened? EGY الكهربا اتقطعت، ليه كده بس؟ Atqtatelectricity,whylikeBs? LEV شكلو مفيش كهربا، ليش هيك؟ JoinedMafeeshlookslikeit,Whytheheck? IRQ شو ماكو كهرباء، خير؟ ShawMakuelectricity,okay?
ResourcesLinguis*cDataConsor*um
• AllArabicresourcescomparedtoEnglishresourceswentfrom3.6%in2000to35%in2016
• Arabicdialectresourcesaccountfor21%ofAllArabicresources• Thesenumbersarenotcompleteofallresources,butfairlyrepresenta*ve.
42
0
50
100
150
200
250
300
350
400
450
ArabicDialects AllArabic English
PublicaAonsGoogleScholarNaturalLanguageProcessing
Publica*onsonArabicdialect,AllArabicandEnglish
0
20000
40000
60000
80000
100000
120000
2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016
ArabicDialectNLP ArabicNLP EnglishNLP
• Onaveragepublica*onsonArabicareequalto6%ofpublica*onsonEnglish
• Arabicdialectspublica*onswentfrom21%ofallArabicpublica*onsin2000to50%in2016(overall37%)
0
20000
40000
60000
80000
100000
120000
0
1000
2000
3000
4000
5000
6000
7000
8000
2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016
ArabicDialectNLP ArabicNLP EnglishNLP
PublicaAonsGoogleScholarNaturalLanguageProcessing
Publica*onsonArabicdialect,AllArabicandEnglish
• Onaveragepublica*onsonArabicareequalto6%ofpublica*onsonEnglish
• Arabicdialectspublica*onswentfrom21%ofallArabicpublica*onsin2000to50%in2016(overall37%)
PublicaAonsGoogleScholarNaturalLanguageProcessing
Publica*onsonaNumberofLanguages
• ManylanguageslagbehindEnglish• ThenumberofGermanna*vespeakersislessthanthirdof
thenumberofArabicna*vespeakers,butGermanhasovertwicethepublica*onscountofArabic
Language PublicaAonssince2000English 107,930French 17,700Chinese 17,500German 16,800Spanish 15,600Arabic 7,019ArabicDialects 2,595
All numbers are over publications in English
Computa*onalProcessingofStandardandDialectalArabic
• TherehasbeengrowingamountofworkonArabicprocessing– Mul*plemorphologicalanalyzers,taggersandautoma*cannota*ontools
• BAMA/SAMA,Elixir,AlKhalil,ALMOR,MADAMIRA,CALIMAetc.• AIDA(dialectIden*fica*on),3arrib(Arabizi-to-Arabic),etc.
– Mul*pletreebanksandparsers• PennATB,CATiB,QuranCorpus,ARZ-TB,Stanfordparser,Camelparser,etc.
– Largecollec*onsofmonolingualtextwithorwithoutannota*ons• Gigaword,newscollec*ons,QALB,YADAC,Curras,Gumar,etc.
– Largecollec*onsofbilingual/mul*lingualtextandlexicons• UNcorpus,newscollec*ons,mul*-dialectcorpora,Tharwa,ArabAquis,etc.
– Sen*mentResources• ArSenL,SLSA,SAMAR,etc.
– Nottomen*onthetradi*onalresourcesonlexicography,morphologyandsyntax!
• IngeneralmoreisdoneonStandardArabicthanthedialects.46
Examples
• SomeexamplesofongoingprojectsonArabiclanguageprocessing– Conven*onalOrthographyforDialectalArabic– MADAMIRAArabictagger– GumarProject– SAMERProject– MADARProject
47
CODAAConven*onalOrthography
forDialectalArabic• Developed for computational processing purposes
(Habash et al, 2012) • Objectives
– CODA covers all Arabic dialects in principle – CODA minimizes differences in choices – CODA is easy to learn and produce consistently – CODA is intuitive to readers unfamiliar with it – CODA uses Arabic script
• Current manuals for Egyptian, Tunisian, Levantine, Algerian, and Gulf
48
CODAExamples
CODA االمتحانات قبل اللي الفترة صحابي ماشفتش
gloss the exams before which the period my friends I did not see
Spelling variants
متحاناتإلا بلأ ـىاللـ هالفتر ـىصحابـ شفتشما
ـمتحاناتلـا بلا لليإ ةرطـالفـ حابيوصـ شفتشمـ
ناتـحـاالمتـ abl ـىللـإ هرطـالفـ ـىحابـوصـ فتشوماشـ
ناتـحـمتـإلا qbl ـيلـا ildra Su7abi فتشوشـما
ناتـحــمتـلـا qabl لىا sohaby فتشوشـمـ
ilim*7anat ـيإلـ mashodish
lim*hanaat إلى
illi
MADAMIRA• State-of-the-artArabicandArabicDialect
Processingtool(Pashaetal.,2014)– Collabora*veeffort
• ColumbiaUniversity(Rambow)• GeorgeWashingtonUniversity(Diab)• NewYorkUniversityAbuDhabi(Habash)
– Morphologicaldisambigua*on– Tokeniza*on– Basephrasechunking– Nameden*tyrecogni*on
• MSAandEgyp*anArabicmodes• Server-modewithXMLinterface
InputArabicText
MorphologicalDisambiguaAon
TokenizaAon
BasePhraseChunking
NamedEnAtyRecogniAon
UserNLPApplicaAons
MADAMIRAhZp://camel.abudhabi.nyu.edu/madamira/
ي •
MADAMIRAhZp://camel.abudhabi.nyu.edu/madamira/
MADAMIRAhZp://camel.abudhabi.nyu.edu/madamira/
MADAMIRAMorphologicalDisambigua*on
System: MSA MSA EGY
Test: MSA EGY EGY
FullAnalysis 84.3% 27.0% 75.4%
DiacriAcizaAon 86.4% 32.2% 83.2%
LemmaAzaAon 96.1% 67.1% 86.3%
BasePOS-tagging 96.1% 82.1% 91.1%
ATBSegmentaAon 99.1% 90.5% 97.4%
wakAtibuhu kAtib_1 pos:noun prc3:0 prc2:wa_conj prc1:0 prc0:0 per:3 asp:na vox:na mod:na gen:m num:s stt:c cas:n enc0:pron3ms
w+ kAtb +h
wkAtbhوكاتبهand his writer
TheGumarCorpus• 100millionwordsofmostlyGulfArabicconversa*onalnovelspublished
anonymouslyonline( النتروايات ‘Internetnovels’)(Khalifaetal.,2016)• NYUADREFFundedtoannotate200Kwordsmanually(Habash).
– Wearehiringannotators!
• GumarCorpusBrowser:hZp://camel.abudhabi.nyu.edu/gumar/
SAMERProject• Simplifica*onofArabicMasterpiecesforExtensiveReading– MuhamedAlKhalil,NizarHabashandDrisSulaimani– NYUADREFfundingfortwoyears(startSep2016)
• Objec*ves– Createastandardforthesimplifica*onofmodernfic*oninArabictoschool-agelearners.
– Developatoolforautoma*ngreadabilityscalegradingforArabic
– SimplifyanumberofArabicfic*onmasterpieces• Publiccompe**on
MADARProject• Mul*-ArabicDialectApplica*onsandResources
• FundedbytheQatarNa*onalResearchFund• Collabora*onamongCMUQ,NYUADandColumbia– NizarHabash,HoudaBouamor,KemalOflazerandOwenRambow
• Modeling25Arabiccitydialects– Lexicalresources,paralleldata,dialectiden*fica*on,anddialectmachinetransla*on
• Lookingforlinguists!
FirstWorkshoponArabicDialectTechnologies
• SponsoredbytheNYUADIns*tute• Aresearchun-conference• 30leadingresearchersonArabiccomputa*onallinguis*cs
• Discussthestateofthefieldandplanitsfuture
• hZp://wardat2016.arabic-nlp.net/
CAMeLLab
59
• Computa*onalApproachestoModelingLanguage• AnewNLP/CLlabatNYUAbuDhabi
– ArabiccoreNLP(morphology,syntax)– Arabicdialectmodeling– Machinetransla*on– Informa*onretrieval
• Wearehiring!!– ResearchScien*sts,Postdocs,ResearchAssistants
• [email protected]• hZp://www.camel-lab.com
60
Roadmap
• OnLanguageTechnologies• ArabicfromaTechnicalPerspec*ve• State-of-the-artArabicTechnology• SummaryandFutureDirecAons
Summary• Arabicposesmanychallengestolanguagetechnologies– Orthographicambiguity
• Under-specifica*onandinconsistency– Morphologicalcomplexity
• Richandcomplexsystemoffeatures– Enormousvariety
• Manydialectsandsub-dialects,codeswitching– Annotatedresourcepoverty
• TherehasbeenalotofworkonArabicandArabicdialecttechnologies.– Butthecurrentperformancelevelsnotacceptable.
FutureDirec*ons• ThefieldofArabiclanguagetechnologiesneedsalotofsupporttokeepgrowing.– Moreresearchersanddevelopers
• Computa*onallinguis*csuniversityprograms– Moreresourceandknowledgesharing
• Opensource,non-commerciallicensemodels• MoreconferencesforArabiclanguagetechnologies• Coordina*onoftechnologicalstandards
– Morefundingtosupportacademicresearchersandstartups
• Buildmoreresourcesandmoretools• Encouragecollabora*onsacrossuniversi*esandamonguniversi*esandcompanies
FutureDirec*ons• AculturearoundArabicthatexpectsitslanguagetechnologytobe– Highquality
• Robust,human-quality
– Seamless• Wellintegratedinotherapplica*ons
– Suppor*ve• languagetechnologytohelptheblindorhearingimpaired• Languagetechnologyforpedagoy
– Personalizable• Understandmanydialects,speakmanydialects
Whattoexpect(linguis*cally)fromarobot?
مع مشاكل أريد ال سرعه على حافظ -السلطات مفهوم -ال ال -
الناس يتكلم كيف إسمعمثلها شىء أى او مفهوم تقول ال
مشكله توجد ال قل إلتهمنى قل أحد عليك تعدى لو
- Keepitunder65.Wedon'twanttobepulledover.
- Affirma*ve.- No,no,no.
Yougottolistentothewaypeopletalk.Youdon'tsay"affirma*ve”orsomeshitlikethat.Yousay"Noproblemo”.IfsomeonecomesuptoyouwithanaOtude,yousay"Eatme".
Terminator 2: John Connor and T-800
Acknowledgements• ColumbiaUniversity(Rambow,Eskander,Alkholy,Salloum,
Alfardy,Altantawy)• TheGeorgeWashingtonUniversity(Diab,Hawwari,
Badrashiny)• CarnegieMellonUniversityQatar(Bouamor,Zaghouani,
Obied,Oflazer)• BirzeitUniversity(Jarrar,Rimawi)• AmericanUniversityofBeirut(Hajj,Baly,Badaro)• UniversityofBahrain(Abdulrahim)• NewYorkUniversityAbuDhabi(AlKhalil,Soulamani)• AndtheNYUADCAMeLeers(Shahrour,Khalifa,Taji,Hasan,
Zalmout,Saddiki,Erdmann)
• hZp://nyuad.nyu.edu/en/
66
Thank You! Questions?