language technologies for arabic and its · pdf filelanguage technologies for arabic and its...

Post on 21-Mar-2018

250 Views

Category:

Documents

7 Downloads

Preview:

Click to see full reader

TRANSCRIPT

LanguageTechnologiesforArabicanditsDialects

NYUADIns*tuteTalk,November15,2016

Prof.NizarHabashNewYorkUniversityAbuDhabi

nizar.habash@nyu.edu

N Y U A DCAMeL Lab

تقنياتاللغةالعربيةولهجاتها

محاضرةمعهدجامعةنيويوركأبوظبي

د.نزارحبشجامعةنيويوركأبوظبي

nizar.habash@nyu.edu

١٥-١١-٢٠١٦N Y U A DCAMeL Lab

3

Roadmap

• OnLanguageTechnologies• ArabicfromaTechnicalPerspec*ve• State-of-the-artArabicTechnology• SummaryandFutureDirec*ons

LanguageTechnologies

LanguageTechnologies•  Alsoknownas

–  NaturalLanguageProcessing–  Computa*onalLinguis*cs–  HumanLanguageTechnology

•  LanguageTechnologyisaninterdisciplinaryfield–  Computerscience,Linguis*cs,Cogni*vescience,psychology,pedagogy,mathema*cs,etc.

•  Languagetechnologiesweresomeoftheearliestapplica*onsofcomputerscience–  Cryptography–  MachineTransla*on

LanguageTechnologies•  Applica*ons

–  Informa*onretrieval–  Machinetransla*on–  Automa*cspeechrecogni*on&speechsynthesis–  Sen*mentandemo*onanalysis–  Dialoguesystems&chaOngagents–  Op*calcharacterrecogni*on–  Automa*cSummariza*on,etc.

•  Enablingtechnologies–  Tokeniza*on–  Part-of-speechtagging–  Syntac*cparsing–  Lemma*za*on–  Wordsensedisambigua*on,etc.

ParadigmsforLanguageTechnologies

•  Rule-basedApproaches– Linguistswriterulesthatareappliedbythemachines

•  MachineLearningApproaches– Corpus-based,Sta*s*calApproaches– Machineslearnthe“rules”fromtrainingdata

•  Machinelearningapproachesaredominantinthefield

Whatdoweneedtohelpmachineslearn?

•  Data,dataandmoredata!•  Specificallyannotateddata

ApplicaAon AnnotatedDataExample

MachineTransla*on Parallelcorpusintwolanguages:UNcorpuswithEnglish,Arabic,Chinese,Spanish,Russian,French

Sen*mentAnalysis Acorpusoftweetswithtagsindica*ng:posi*ve,nega*ve,neutral.

SpeechRecogni*on Acorpusofaudiofileswiththeircorrespondingtranscripts

Op*calCharacterRecogni*on

Acorpusofscannedbookpageimagesandtheircorrespondingtranscripts.

Part-of-Speech AnEnglishcorpuswithPart-of-Speechindicatedforeachword

ChallengesforMachineLearningLanguageTechnologies

•  Sizeoftrainingdata– MoreisbeZer!

•  Domainandgenresensi*vity– Systemstrainedonnewsdonotdowellonnovels

•  Qualityofannota*ons– Whyexpectgoodperformanceifhumansdonotagreewitheachotheronthetask

•  Developingrobustalgorithmsformachinelearningisessen*al

•  d

10

MachineLearningvs.HumanLearning

Predisposedforacquiringlanguagenot so!

ChallengesforMachineLearningLanguageTechnologies

•  Sizeoftrainingdata– MoreisbeZer!

•  Domainandgenresensi*vity–  Systemstrainedonnewsdonotdowellonnovels

•  Qualityofannota*ons– Whyexpectgoodperformanceifhumansdonotagreewitheachotheronthetask

•  DevelopingrobustalgorithmsformachinelearningisessenAal

12

Roadmap

• OnLanguageTechnologies• ArabicfromaTechnicalPerspecAve• State-of-the-artArabicTechnology• SummaryandFutureDirec*ons

13

Roadmap

• OnLanguageTechnologies• ArabicfromaTechnicalPerspecAve

– WriAngsystem– Wordstructure– Dialects

• State-of-the-artArabicTechnology• SummaryandFutureDirec*ons

14

ArabicScript

• AnAbjad(consonantalalphabetwithdiacri*cs)• WriZenright-to-led• LeZershavecontextualvariants• UsedtowritemanylanguagesbesidesArabic:Persian,Kurdish,Urdu,Pashto,etc.

العربي الخط

15

Unicode •  The international

encoding standard •  Widely supported input

and display •  Supports extended Arabic

characters •  Multi-script representation

الحب 0000011000100111000001100100010000000110001011010000011000101000

Arabic Input/Output

•  Letter-based keyboard •  Logical order input

– First-to-Last •  Visual display

protocols – Right-to-Left – Letter shaping

16

س ل ا م سالم

17

Display Problems

في حرة منطقة تدشنيااللكترونية للتجارة دبي

ع�ع��ظ�ظ�؛؟ظ�ع ع�ظ�ع�ظ�ظظ�ظ�ع�ع�ظ�ظ�ع�ع�ع�ظ�ظ،ظ�ظ�ظ�ظ�ع�ظ�ع�ع�ظ�ظ�ع�ع�ع�ظ�

ظ ظٹظ ط ؟طھطط ط ط ط ظ ط ظ

ظٹ ط ط ظپظٹط ط ط طھط ظ ظ

ظ ظ ط ظ ط طھطظٹط ظ ظ

� � ꠤǤǤ

في حرة منطقة تدشنيااللكترونية للتجارة دبي

هوتدش حرةو ةننتجارةدبل

ةوانانمتر

�䠣䘞 ݭኌ ǡǡߊ

حرة كلظ�ة تدشل� ففتجارة دب

افاف�ترنnلة

في حرة منطقة تدشنيااللكترونية للتجارة دبي

Western Unicode ISO-8859 CP-1256 Display Encoding

CP-

1256

IS

O-8

859

Uni

code

Act

ual E

ncod

ing

ArabicScript•  Arabicscriptusesasetofop*onaldiacri*cs

– Only1.5%ofwordshaveatleastonediacri*c

–  Combinable•  /kattab/ to dictate

Vowel NunaAon GeminaAon

ب/ba/ب/bu/ب/bi/

ب/b/ب/ban/ب/bun/

ب/bin/ ب

/bb/

كnتب

اسبانياتنفيتجميداملساعدةاملمنوحةللمغربمدريد1-11)افب(-اكدرئيسالحكومةاالسبانيةخوسيهماريا

اثناراليومالخميساناسبانيالمتوقفاملساعدةالتيتقدمهاللمغربخالفاملااكدهامساالربعاءوزيرالشؤونالخارجيةوالتعاوناملغربي

محمدبنعيسىاماممجلسالنواباملغربي.وقالرئيسالحكومةاالسبانيةفيمؤتمرصحافيانالتعاونبنياسبانياواملغربلميتوقف

ابداولميجمد.

اسبانياتnنفيتجميداملساعدةاملمnنوحةللمغربدرئيسالحكومةاالسبانيةخوسيهماريا مدريد1-11)افب(-اك

اسبانيالمتوقفاملساعدةالتيتقدمهاللمغرب اثناراليومالخميسان امساالربعاءوزيرالشؤونالخارجيةوالnتعاوناملغربي ده خالفnاملااك

.وقالرئيسالحكومة دبنعيسىاماممجnلسالnنواباملغربي محماسبانياواملغربلميnتوقف الnتعاونبني ان االسبانيةفيمؤتمرصحافي

د. ابداولميجم

20

OrthographicAmbiguity•  Arabicwordscanbeveryambiguousduetoop*onal

diacri*cs•  Buthowambiguous?•  Classicexample

thsswhtnrbctxtlkslkwthnvwlsthisiswhatanArabictextlookslikewithnovowels–  Notexactlytrue

•  LongvowelsarealwayswriZen•  Ini*alvowelsarerepresentedbyanا‘Alif’•  Somefinalshortvowelsaredeterminis*callyinferable

thsiswhtanArbctxtlkslikwthnovwls

•  Foracomputer…–  Awordonaveragehas12.3analyses,6.8diacriAzaAons,

and2.7lemmas(coremeanings)•  Notallofthisambiguityisduetoorthography!Moreonthislater.

•  TheQatarArabicLanguageBank(QALB,PIHabash)projectfoundaveryhigh(30%)ofwordshaveerrorsinuneditedStandardArabiccommentsonAljazeera.

•  Arabicspellingerrorsareabigchallengetolanguagetechnologies–  GIGO:GarbageInGarbageOut–  ErrorsinStandardArabic–  InconsistenciesinDialectalArabic(noofficialstandard)

•  Robustsystemsneedaddi*onalfunc*onalitytoallowforcorrec*ngerrorsorfunc*oningwelldespitethem.

SpellingErrors

21

22

Roadmap

• OnLanguageTechnologies• ArabicfromaTechnicalPerspecAve

– Wri*ngsystem– Wordstructure– Dialects

• State-of-the-artArabicTechnology• SummaryandFutureDirec*ons

MorphologicalComplexity

•  Arabicismorphologicallyrich– Acorewordhasmanyinflectedforms–  Example: Arabic Verbs have 5,400 forms

Gender(2), Number(3), Person(3), Aspect(3), Tense particle (2), Mood(3), Voice(2), Pronominal clitic(12), Conjunction clitic(3)

23

وسنقولها/wasanaqūluhā/

ها+قول+ن+س+وwa+sa+na+qūl+u+hāand+will+we+say+itAndwewillsayit

قال،قالت،قاال،قالوا،قلت،قلت،قلتما،قلتم،قلنت،

يقول،يقول،يقل،تقول،تقول،تقل،تقولني،تقولي،

...فقال،فقالت،فقاال...،...وسنقولها...وسأقولها،

MorphologicalComplexity

•  Englishisnotmorphologicallyrich.–  Thenumberofinflectedformsissmall–  Theverbparadigmislimitedto6

–  ThecompleteEnglishpart-of-speechtagsethas48tags

–  ThecompleteArabicpart-of-speechtagsethas22,400tags

24

VB VBD VBG VBN VBP VBZ go went going gone go goes

MorphologicalAmbiguity•  12.3 analyses and 2.7 lemmas per word •  Spelling ambiguity

–  Optional diacritics –  Suboptimal spelling, e.g., (أ, إ à ا) or (ة à ه ) –  Example: وبادلتها�

•  Derivational ambiguity and homonymy

+ها +أدلة +ب وandwithherpiecesofevidence

+ها +بادلت وandIexchangedwithher

العيـن theeye,thewaterspring,Al-Aincity,thenotable

املحتلoccupier,occupied

(العدواملحتل/الوطناملحتل/الدولاملحتلة)

Analysisvs.Disambigua*on

Will will Ben Affleck be a good Batman?

PV+PVSUFF_SUBJ:3MS بني Hedemonstrated

PV+PVSUFF_SUBJ:3FP بني Theydemonstrated(f.p)

NOUN_PROP بني Ben

ADJ بني Clear

PREP بني Between,among

PREP+NOUN_PROP بني InYen

Morphological Analysis is out-of-context Morphological Disambiguation is in-context

سينجح باتمان؟بنيهل دور في أفليك

Analysisvs.DisambiguaAon

Will Ben Affleck be a good Batman?

PV+PVSUFF_SUBJ:3MS بني Hedemonstrated

PV+PVSUFF_SUBJ:3FP بني Theydemonstrated(f.p)

NOUN_PROP بني Ben

ADJ بني Clear

PREP بني Between,among

PREP+NOUN_PROP بني InYen

Morphological Analysis is out-of-context Morphological Disambiguation is in-context

سينجح باتمان؟بنيهل دور في أفليك

*

28

Roadmap

• OnLanguageTechnologies• ArabicfromaTechnicalPerspecAve

– Wri*ngsystem– Wordstructure– Dialects

• State-of-the-artArabicTechnology• SummaryandFutureDirec*ons

29

ArabicanditsDialects•  Arabichas~360Mspeakers•  FormsofArabic

–  ClassicalArabic(CA)•  ClassicalHistoricaltexts•  Liturgicaltexts

–  ModernStandardArabic(MSA)•  Newsmedia&formalspeechesandseOngs•  OnlywriZenstandard

–  DialectalArabic(DA)•  Predominantlyspokenvernaculars•  NowriZenstandards

•  Diglossia–  Twoformsofthelanguageexistsidebyside

ArabicanditsDialects•  Officiallanguage:ModernStandardArabic(MSA)

Ø Noone’sna*velanguage•  RegionalDialects

–  Egyp*anArabic(EGY)–  Levan*neArabic(LEV)–  GulfArabic(GLF)–  NorthAfricanArabic(NOR):Moroccan,Algerian,Tunisian–  Iraqi,Yemenite,Sudanese

•  Dialectsandsub-dialects…–  City,Rural,Bedouin

DialectsorLanguages?•  Theargumentsfrompower

–  “alanguageisadialectwithanarmy”.–  Religion,na*onalism,regionalism,iden*ty

•  Theargumentsfromlinguis*cdifference.–  Degreesofmutualintelligibility–  “Theeagercommunicator”and“theeavesdroptest”

•  Theviewfromlanguagetechnology–  ThisquesGonisirrelevant.–  HowdowemodelArabicasadiglossicsystem?

•  Varia*onsandtheirfunc*on–  HowdowemodelhowArabscommunicate?

•  Behaviorandexpecta*on–  Canweexploitsimilari*esamongdialectsandbetweenMSAand

dialectstobuildbeZersystems?•  Technologyandpower

31

32

PhonologicalVaria*ons

•  Major variants

MSA Dialects ق /q/ /q/,/k/,/ʔ/,/g/,/ʤ/,/ɢ/ث /θ/ /θ/,/t/,/s/ذ /δ/ /δ/,/d/,/z/ج /ʤ/ /ʤ/,/g/,/ʒ/

SpellingInconsistency

مابيقولهاشمبيقولهاشمابقولهاشمبقولهاشمابيقلهاشمبيقلهاشمابقلهاشمبقلهاشمابيئولهاشمبيئولهاش 33

مبيnئلهاشمابnئلهاشمبnئلهاشمابيؤلهاشمبيؤلهاشمابؤلهاشمبؤلهاشمابئولهاشمبئولهاشمابيnئلهاش

Mabe’ulhashMabi’ulhashMabequlhashMabiqulhashMabeulhashMabiulhashMabe’ulhachMabi’ulhachMabequlhachMabiqulhach…

EgypAanArabicwordمابيقولهاش/mabiʔulhāʃ/

“hedoesnotsayit

If there is no standard, can a word be misspelled?

34

LexicalVaria*on

English Table Cat Of Iwant Hewillwrite Thereisn t

MSA Tāwilaطاولة

qiTTaقطة

idafaØ

uriduاريد

sayaktubuيكتبسـ

lāyujaduاليوجد

Moroccan midaميدة

qeTTaقطة

dyālديال

bγītبغيت

γajektebيكتبغـ

mākāynšماكاينش

EgypAan Tarabēzaطربيزة

oTTaقطة

bitāςبتاع

ςāwezعاوز

hayik*bيكتبهـ

maâšمفيش

Syrian Tāwleطاولة

bisseبسة

tabaςتبع

biddiبدي

Hayoktobيكتبحـ

māfiمافي

Iraqi mēzميز

bazzūnaبزونة

mālمال

arīdاريد

raHyik*bيكتبرح

mākuماكو

35

LexicalVaria*on

o براد EGY:keZle-LEV:fridgeo  مرا EGY:pros*tute-LEV:womano يnnnماش EGY/LEV:okay–MOR:noto طnnnبس EGY/LEV:makehappy–IRQ:beatupo شnnnبل LEV:start–SUD:end

36

MorphologicalVaria*on

•  Someaspectsofwordsaresimplifiedinthedialects–  Lossofcasemarking

–  Consolida*onofmasculineandfeminineplurals

–  Lossofsomedualforms

•  Otheraspectsincreaseincomplexity!

كتاب كتاب، كتابnا، كتاب، كتاب، كتاب، à كتاب

يكتنب يكتبون، يكتبوا، à يكتبون يكتبوا،

يكتبا يكتبان، à يكتبون يكتبوا،

37

MorphologicalVaria*onVerbMorphology

conjverbobject subj tense

IOBJ negneg

MSAولمتكتبوهاله

/walamtaktubūhālahu//wa+lamtaktubū+hāla+hu/and+not_pastwrite_you+itfor+him

EGYوماكتبتوهالوش

/wimakatabtuhalūʃ//wi+ma+katab+tu+ha+lū+ʃ/

and+not+wrote+you+it+for_him+not

Andyoudidn twriteitforhim

38

WhyWorkonArabicDialects?•  DialectsaretheprimaryformofArabicusedinallunscriptedspokengenres:conversa*onal,talkshows,interviews,etc.–  Speechrecogni*onsystemsmustmodeldialects

•  DialectsareincreasinglyinuseinnewwriZenmedia(newsgroups,weblogs,forumsetc.)–  Textanaly*csofArabicmustincludedialectalmodeling

•  Substan*alDialect-MSAdifferencesimpededirectapplica*onofMSANLPtools–  36%ofEgyp*anwordsarenotrecognizableusingMSAanalyzers(Habashetal.,2012)

39

Roadmap

• OnLanguageTechnologies• ArabicfromaTechnicalPerspec*ve• State-of-the-artArabicTechnology• SummaryandFutureDirec*ons

ComparingPerformance

•  Part-of-SpeechTaggingandSyntaxParsing

Resultsfrom(Björkelundetal.2013,Pashaetal.,2014,Weissetal,2015,Kumaretal.,2016)

–  LargegapbetweenEnglishandArabic;andbetweenStandardArabicandArabicdialects

– MoreresourcesandmoreresearcheffortsforEnglishcomparedtoArabic.

40

English StandardArabic EgypAanArabic

FullPart-of-Speech 97.6% 85.4% 75.5%

CorePOSPart-of-Speech 96.1% 91.1%

DependencySyntax 92.2% 86.2%

ComparingPerformance

•  MachineTranslaAon

–  Qualityofmachinetransla*onfromMSAismuchbeZerthaninthedialects

–  Themainreasonisavailabilityofparallelcorpora•  150millionwordsofparallelStandardArabic-Englishtextcomparedto1.5millionwordsofDialect-Englishtext(Zbibetal.,2012)

41

ArabicSourceText GoogleTranslate(Nov12,2016)

MSA ال يوجد كهرباء، ماذا حدث؟ Noelectricity,whathappened? EGY الكهربا اتقطعت، ليه كده بس؟ Atqtatelectricity,whylikeBs? LEV شكلو مفيش كهربا، ليش هيك؟ JoinedMafeeshlookslikeit,Whytheheck? IRQ شو ماكو كهرباء، خير؟ ShawMakuelectricity,okay?

ResourcesLinguis*cDataConsor*um

•  AllArabicresourcescomparedtoEnglishresourceswentfrom3.6%in2000to35%in2016

•  Arabicdialectresourcesaccountfor21%ofAllArabicresources•  Thesenumbersarenotcompleteofallresources,butfairlyrepresenta*ve.

42

0

50

100

150

200

250

300

350

400

450

ArabicDialects AllArabic English

PublicaAonsGoogleScholarNaturalLanguageProcessing

Publica*onsonArabicdialect,AllArabicandEnglish

0

20000

40000

60000

80000

100000

120000

2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016

ArabicDialectNLP ArabicNLP EnglishNLP

•  Onaveragepublica*onsonArabicareequalto6%ofpublica*onsonEnglish

•  Arabicdialectspublica*onswentfrom21%ofallArabicpublica*onsin2000to50%in2016(overall37%)

0

20000

40000

60000

80000

100000

120000

0

1000

2000

3000

4000

5000

6000

7000

8000

2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016

ArabicDialectNLP ArabicNLP EnglishNLP

PublicaAonsGoogleScholarNaturalLanguageProcessing

Publica*onsonArabicdialect,AllArabicandEnglish

•  Onaveragepublica*onsonArabicareequalto6%ofpublica*onsonEnglish

•  Arabicdialectspublica*onswentfrom21%ofallArabicpublica*onsin2000to50%in2016(overall37%)

PublicaAonsGoogleScholarNaturalLanguageProcessing

Publica*onsonaNumberofLanguages

•  ManylanguageslagbehindEnglish•  ThenumberofGermanna*vespeakersislessthanthirdof

thenumberofArabicna*vespeakers,butGermanhasovertwicethepublica*onscountofArabic

Language PublicaAonssince2000English 107,930French 17,700Chinese 17,500German 16,800Spanish 15,600Arabic 7,019ArabicDialects 2,595

All numbers are over publications in English

Computa*onalProcessingofStandardandDialectalArabic

•  TherehasbeengrowingamountofworkonArabicprocessing–  Mul*plemorphologicalanalyzers,taggersandautoma*cannota*ontools

•  BAMA/SAMA,Elixir,AlKhalil,ALMOR,MADAMIRA,CALIMAetc.•  AIDA(dialectIden*fica*on),3arrib(Arabizi-to-Arabic),etc.

–  Mul*pletreebanksandparsers•  PennATB,CATiB,QuranCorpus,ARZ-TB,Stanfordparser,Camelparser,etc.

–  Largecollec*onsofmonolingualtextwithorwithoutannota*ons•  Gigaword,newscollec*ons,QALB,YADAC,Curras,Gumar,etc.

–  Largecollec*onsofbilingual/mul*lingualtextandlexicons•  UNcorpus,newscollec*ons,mul*-dialectcorpora,Tharwa,ArabAquis,etc.

–  Sen*mentResources•  ArSenL,SLSA,SAMAR,etc.

–  Nottomen*onthetradi*onalresourcesonlexicography,morphologyandsyntax!

•  IngeneralmoreisdoneonStandardArabicthanthedialects.46

Examples

•  SomeexamplesofongoingprojectsonArabiclanguageprocessing–  Conven*onalOrthographyforDialectalArabic– MADAMIRAArabictagger– GumarProject–  SAMERProject– MADARProject

47

CODAAConven*onalOrthography

forDialectalArabic•  Developed for computational processing purposes

(Habash et al, 2012) •  Objectives

–  CODA covers all Arabic dialects in principle –  CODA minimizes differences in choices –  CODA is easy to learn and produce consistently –  CODA is intuitive to readers unfamiliar with it –  CODA uses Arabic script

•  Current manuals for Egyptian, Tunisian, Levantine, Algerian, and Gulf

48

CODAExamples

CODA االمتحانات قبل اللي الفترة صحابي ماشفتش

gloss the exams before which the period my friends I did not see

Spelling variants

متحاناتإلا بلأ ـىاللـ هالفتر ـىصحابـ شفتشما

ـمتحاناتلـا بلا لليإ ةرطـالفـ حابيوصـ شفتشمـ

ناتـحـاالمتـ abl ـىللـإ هرطـالفـ ـىحابـوصـ فتشوماشـ

ناتـحـمتـإلا qbl ـيلـا ildra Su7abi فتشوشـما

ناتـحــمتـلـا qabl لىا sohaby فتشوشـمـ

ilim*7anat ـيإلـ mashodish

lim*hanaat إلى

illi

MADAMIRA•  State-of-the-artArabicandArabicDialect

Processingtool(Pashaetal.,2014)–  Collabora*veeffort

•  ColumbiaUniversity(Rambow)•  GeorgeWashingtonUniversity(Diab)•  NewYorkUniversityAbuDhabi(Habash)

–  Morphologicaldisambigua*on–  Tokeniza*on–  Basephrasechunking–  Nameden*tyrecogni*on

•  MSAandEgyp*anArabicmodes•  Server-modewithXMLinterface

InputArabicText

MorphologicalDisambiguaAon

TokenizaAon

BasePhraseChunking

NamedEnAtyRecogniAon

UserNLPApplicaAons

MADAMIRAhZp://camel.abudhabi.nyu.edu/madamira/

 ي •

MADAMIRAhZp://camel.abudhabi.nyu.edu/madamira/

MADAMIRAhZp://camel.abudhabi.nyu.edu/madamira/

MADAMIRAMorphologicalDisambigua*on

System: MSA MSA EGY

Test: MSA EGY EGY

FullAnalysis 84.3% 27.0% 75.4%

DiacriAcizaAon 86.4% 32.2% 83.2%

LemmaAzaAon 96.1% 67.1% 86.3%

BasePOS-tagging 96.1% 82.1% 91.1%

ATBSegmentaAon 99.1% 90.5% 97.4%

wakAtibuhu kAtib_1 pos:noun prc3:0 prc2:wa_conj prc1:0 prc0:0 per:3 asp:na vox:na mod:na gen:m num:s stt:c cas:n enc0:pron3ms

w+ kAtb +h

wkAtbhوكاتبهand his writer

TheGumarCorpus•  100millionwordsofmostlyGulfArabicconversa*onalnovelspublished

anonymouslyonline( النتروايات ‘Internetnovels’)(Khalifaetal.,2016)•  NYUADREFFundedtoannotate200Kwordsmanually(Habash).

–  Wearehiringannotators!

•  GumarCorpusBrowser:hZp://camel.abudhabi.nyu.edu/gumar/

SAMERProject•  Simplifica*onofArabicMasterpiecesforExtensiveReading– MuhamedAlKhalil,NizarHabashandDrisSulaimani– NYUADREFfundingfortwoyears(startSep2016)

•  Objec*ves– Createastandardforthesimplifica*onofmodernfic*oninArabictoschool-agelearners.

– Developatoolforautoma*ngreadabilityscalegradingforArabic

– SimplifyanumberofArabicfic*onmasterpieces•  Publiccompe**on

MADARProject•  Mul*-ArabicDialectApplica*onsandResources

•  FundedbytheQatarNa*onalResearchFund•  Collabora*onamongCMUQ,NYUADandColumbia– NizarHabash,HoudaBouamor,KemalOflazerandOwenRambow

•  Modeling25Arabiccitydialects– Lexicalresources,paralleldata,dialectiden*fica*on,anddialectmachinetransla*on

•  Lookingforlinguists!

FirstWorkshoponArabicDialectTechnologies

•  SponsoredbytheNYUADIns*tute•  Aresearchun-conference•  30leadingresearchersonArabiccomputa*onallinguis*cs

•  Discussthestateofthefieldandplanitsfuture

•  hZp://wardat2016.arabic-nlp.net/

CAMeLLab

59

•  Computa*onalApproachestoModelingLanguage•  AnewNLP/CLlabatNYUAbuDhabi

–  ArabiccoreNLP(morphology,syntax)–  Arabicdialectmodeling–  Machinetransla*on–  Informa*onretrieval

•  Wearehiring!!–  ResearchScien*sts,Postdocs,ResearchAssistants

•  Contactnizar.habash@nyu.edu•  hZp://www.camel-lab.com

60

Roadmap

• OnLanguageTechnologies•  ArabicfromaTechnicalPerspec*ve•  State-of-the-artArabicTechnology•  SummaryandFutureDirecAons

Summary•  Arabicposesmanychallengestolanguagetechnologies– Orthographicambiguity

•  Under-specifica*onandinconsistency– Morphologicalcomplexity

•  Richandcomplexsystemoffeatures– Enormousvariety

•  Manydialectsandsub-dialects,codeswitching– Annotatedresourcepoverty

•  TherehasbeenalotofworkonArabicandArabicdialecttechnologies.– Butthecurrentperformancelevelsnotacceptable.

FutureDirec*ons•  ThefieldofArabiclanguagetechnologiesneedsalotofsupporttokeepgrowing.– Moreresearchersanddevelopers

•  Computa*onallinguis*csuniversityprograms– Moreresourceandknowledgesharing

•  Opensource,non-commerciallicensemodels•  MoreconferencesforArabiclanguagetechnologies•  Coordina*onoftechnologicalstandards

– Morefundingtosupportacademicresearchersandstartups

•  Buildmoreresourcesandmoretools•  Encouragecollabora*onsacrossuniversi*esandamonguniversi*esandcompanies

FutureDirec*ons•  AculturearoundArabicthatexpectsitslanguagetechnologytobe–  Highquality

•  Robust,human-quality

–  Seamless•  Wellintegratedinotherapplica*ons

–  Suppor*ve•  languagetechnologytohelptheblindorhearingimpaired•  Languagetechnologyforpedagoy

–  Personalizable•  Understandmanydialects,speakmanydialects

Whattoexpect(linguis*cally)fromarobot?

مع مشاكل أريد ال سرعه على  حافظ -السلطات مفهوم -ال  ال -

الناس يتكلم كيف إسمعمثلها شىء أى او مفهوم تقول ال

مشكله توجد ال قل إلتهمنى قل أحد عليك تعدى لو

-  Keepitunder65.Wedon'twanttobepulledover.

-  Affirma*ve.-  No,no,no.

Yougottolistentothewaypeopletalk.Youdon'tsay"affirma*ve”orsomeshitlikethat.Yousay"Noproblemo”.IfsomeonecomesuptoyouwithanaOtude,yousay"Eatme".

Terminator 2: John Connor and T-800

Acknowledgements•  ColumbiaUniversity(Rambow,Eskander,Alkholy,Salloum,

Alfardy,Altantawy)•  TheGeorgeWashingtonUniversity(Diab,Hawwari,

Badrashiny)•  CarnegieMellonUniversityQatar(Bouamor,Zaghouani,

Obied,Oflazer)•  BirzeitUniversity(Jarrar,Rimawi)•  AmericanUniversityofBeirut(Hajj,Baly,Badaro)•  UniversityofBahrain(Abdulrahim)•  NewYorkUniversityAbuDhabi(AlKhalil,Soulamani)•  AndtheNYUADCAMeLeers(Shahrour,Khalifa,Taji,Hasan,

Zalmout,Saddiki,Erdmann)

•  hZp://nyuad.nyu.edu/en/

66

Thank You! Questions?

top related