nlp tools development for tamil language - start [ufal wiki] differences morphological noun cases...

32
NLP TOOLS DEVELOPMENT FOR TAMIL LANGUAGE Loganathan, UFAL

Upload: dangkhue

Post on 07-May-2018

230 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: NLP Tools development for tamil language - start [ufal wiki] Differences Morphological Noun Cases Tamil suffixes English words Nominative Ø Ø Accusative ai Ø Dative kku, ukku, ku

NLP TOOLS DEVELOPMENT FOR TAMIL LANGUAGELoganathan, UFAL

Page 2: NLP Tools development for tamil language - start [ufal wiki] Differences Morphological Noun Cases Tamil suffixes English words Nominative Ø Ø Accusative ai Ø Dative kku, ukku, ku

Overview

IntroductionMy work involving Tamil NLP

English – Tamil MTTamil Morphological Analyzer

Page 3: NLP Tools development for tamil language - start [ufal wiki] Differences Morphological Noun Cases Tamil suffixes English words Nominative Ø Ø Accusative ai Ø Dative kku, ukku, ku

Introduction – Indian Languages

23 official languages29 languages have

more than 1 million native speakers in India.

TamilApprox 67 million speakers in India

Page 4: NLP Tools development for tamil language - start [ufal wiki] Differences Morphological Noun Cases Tamil suffixes English words Nominative Ø Ø Accusative ai Ø Dative kku, ukku, ku

Introduction – Resources for Tamil

Tamil Editing/Unicode Support - Available

Dictionary – AvailableTamil lexicon, Winslow, Fabricius, McAlpin, Kathirvelu pillai – Published online by Univ. Of Chicago

Morphological Analyzer/Tagger – Partially Available

Phrase Structure/Dependency Parser – NO

Parallel Corpora – NO (publicly, readily)Active Bilingual websites: www.wsws.org, www.cinesouth.comTamilnadu government schoolbooks (with English translations)

Page 5: NLP Tools development for tamil language - start [ufal wiki] Differences Morphological Noun Cases Tamil suffixes English words Nominative Ø Ø Accusative ai Ø Dative kku, ukku, ku

My Work involving Tamil NLP

Enlgish – Tamil Translation System (Master’s Thesis)Morphological Analyzer

Page 6: NLP Tools development for tamil language - start [ufal wiki] Differences Morphological Noun Cases Tamil suffixes English words Nominative Ø Ø Accusative ai Ø Dative kku, ukku, ku

English – Tamil Translation System

Page 7: NLP Tools development for tamil language - start [ufal wiki] Differences Morphological Noun Cases Tamil suffixes English words Nominative Ø Ø Accusative ai Ø Dative kku, ukku, ku

General Differences

Morphological

Noun Cases Tamil suffixes English words

Nominative Ø Ø

Accusative ai Ø

Dative kku, ukku, ku Ø, to, for, at

Benefactive (u)kkuAka for

Instrumental Al with, of, by

Sociative Otu, utan with

Locative il, itam in, on, among, to, with, from

Ablative iliruwTu, itamiruwTu from

Genitive in, utaiya, aTu ‘s, of

Page 8: NLP Tools development for tamil language - start [ufal wiki] Differences Morphological Noun Cases Tamil suffixes English words Nominative Ø Ø Accusative ai Ø Dative kku, ukku, ku

General Differences

Syntactical difference

Kumar talked about linguistics

kumAr moziyiyalaip paRRip pEcinAn

குமார் ெமாழியியைலப் பற்றிப் ேபசினான்

moziyiyalaip paRRip kumAr pEcinAn

Page 9: NLP Tools development for tamil language - start [ufal wiki] Differences Morphological Noun Cases Tamil suffixes English words Nominative Ø Ø Accusative ai Ø Dative kku, ukku, ku

General Differences

Syntactical (complex sentences)

Pollution Control Authority’s regional officer said

mAcuk kattuppAttu vAriya aTikAri TeriviTTArmaTTiya amaiccarin karuTTil TangkaLaTu TuRaikku utanpAtu illai enamAcuk kattuppAttu vAriyam ceyalpaTavillai enRa

that his department is not agreeing with the central minister’s opinionthat Pollution Control Authority is not functioning

மாசுக்கட் ப்பாட் வாாியம் ெசயல்படவில்ைல என்றமத்திய அைமச்சாின் க த்தில் தங்கள ைறக்குஉடன்பா இல்ைல எனமாசுக்கட் ப்பாட் வாாிய அதிகாாி ெதாிவித்தார்

MCSCRC

MC SC RC-> RC SC MC

Page 10: NLP Tools development for tamil language - start [ufal wiki] Differences Morphological Noun Cases Tamil suffixes English words Nominative Ø Ø Accusative ai Ø Dative kku, ukku, ku

How hard is English-Tamil MT

The previous examples illustrateTamil -> SOV, English -> SVOTamil is a restricted free word order languageTamil is agglutinative

Difference occurs inSyntactical level i.e word orderingMorphological level

We need An efficient syntax reordering moduleMorphological generator

Page 11: NLP Tools development for tamil language - start [ufal wiki] Differences Morphological Noun Cases Tamil suffixes English words Nominative Ø Ø Accusative ai Ø Dative kku, ukku, ku

Approaches

Syntax Transfer Based MTStatistical Machine Translation (SMT)

Page 12: NLP Tools development for tamil language - start [ufal wiki] Differences Morphological Noun Cases Tamil suffixes English words Nominative Ø Ø Accusative ai Ø Dative kku, ukku, ku

Syntax Transfer MT

Syntax Transfer MT: Architecture

Page 13: NLP Tools development for tamil language - start [ufal wiki] Differences Morphological Noun Cases Tamil suffixes English words Nominative Ø Ø Accusative ai Ø Dative kku, ukku, ku

MT using Synchronous CFG

Page 14: NLP Tools development for tamil language - start [ufal wiki] Differences Morphological Noun Cases Tamil suffixes English words Nominative Ø Ø Accusative ai Ø Dative kku, ukku, ku

Source Grammar – Tree Adjoining Grammar

Tree Generating System introduced by Aravind JoshiTAG – Multilevel tree rewriting systemBasic units (Elementary trees)

Initial trees (Basic structures)Auxiliary trees (Recursive structures)

Page 15: NLP Tools development for tamil language - start [ufal wiki] Differences Morphological Noun Cases Tamil suffixes English words Nominative Ø Ø Accusative ai Ø Dative kku, ukku, ku

TAG – Formal Definition

Page 16: NLP Tools development for tamil language - start [ufal wiki] Differences Morphological Noun Cases Tamil suffixes English words Nominative Ø Ø Accusative ai Ø Dative kku, ukku, ku

TAG Operations

Substitution and Adjunction/ Ex: from XTAG Manual

Page 17: NLP Tools development for tamil language - start [ufal wiki] Differences Morphological Noun Cases Tamil suffixes English words Nominative Ø Ø Accusative ai Ø Dative kku, ukku, ku

TAG Derivation Structure

Derivation Tree: “Srini bought a book”/ Ex: from XTAG Manual

Page 18: NLP Tools development for tamil language - start [ufal wiki] Differences Morphological Noun Cases Tamil suffixes English words Nominative Ø Ø Accusative ai Ø Dative kku, ukku, ku

Parsing Lexicalized TAGs

Many parsing algorithms were suggested, including CYK parser for TAG, Head-Corner parsing algorithm, Bidirectional parsing algorithm and more recent work on Statistical LTAG parsing.For parsing source side, Yves Schabes algorithm was implemented in Java.

Page 19: NLP Tools development for tamil language - start [ufal wiki] Differences Morphological Noun Cases Tamil suffixes English words Nominative Ø Ø Accusative ai Ø Dative kku, ukku, ku

Tran

sfer

Gra

mm

ar

Page 20: NLP Tools development for tamil language - start [ufal wiki] Differences Morphological Noun Cases Tamil suffixes English words Nominative Ø Ø Accusative ai Ø Dative kku, ukku, ku

Experiments and Results

The entire translation system is written in Java.Implemented modules include LTAG parser for English, STAG system for syntax reordering of English into Tamil.Our system uses the same language resources developed for XTAG system for parsing the source side sentence. All XTAG related databases have been converted into Mysql format.

Page 21: NLP Tools development for tamil language - start [ufal wiki] Differences Morphological Noun Cases Tamil suffixes English words Nominative Ø Ø Accusative ai Ø Dative kku, ukku, ku

LTAG Tree Editor for Visualization

Collaborative effort between Amrita and CDAC

Page 22: NLP Tools development for tamil language - start [ufal wiki] Differences Morphological Noun Cases Tamil suffixes English words Nominative Ø Ø Accusative ai Ø Dative kku, ukku, ku

Experiments and Results

Sample Output

Page 23: NLP Tools development for tamil language - start [ufal wiki] Differences Morphological Noun Cases Tamil suffixes English words Nominative Ø Ø Accusative ai Ø Dative kku, ukku, ku

Statistical Machine Translation

EILMT English-Tamil Parallel Corpus

Monolingual Tamil Data

Health Tourism

6000 15000

#Sentences #Words

Training data 95464 >1.2 million

Test data 1000 12K

Page 24: NLP Tools development for tamil language - start [ufal wiki] Differences Morphological Noun Cases Tamil suffixes English words Nominative Ø Ø Accusative ai Ø Dative kku, ukku, ku

Statistical Machine Translation

Monolingual data: Sensitivity to Morphology

Page 25: NLP Tools development for tamil language - start [ufal wiki] Differences Morphological Noun Cases Tamil suffixes English words Nominative Ø Ø Accusative ai Ø Dative kku, ukku, ku

SMT – Results for Health Corpus

a) Sensitivity to sentence length & Morphology

b) Sensitivity to training corpus size & Morphology

Page 26: NLP Tools development for tamil language - start [ufal wiki] Differences Morphological Noun Cases Tamil suffixes English words Nominative Ø Ø Accusative ai Ø Dative kku, ukku, ku

SMT – Results for Tourism Corpus

a) Sensitivity to sentence length & Morphology

b) Sensitivity to training corpus size & Morphology

Page 27: NLP Tools development for tamil language - start [ufal wiki] Differences Morphological Noun Cases Tamil suffixes English words Nominative Ø Ø Accusative ai Ø Dative kku, ukku, ku

SMT – Sample Output

Page 28: NLP Tools development for tamil language - start [ufal wiki] Differences Morphological Noun Cases Tamil suffixes English words Nominative Ø Ø Accusative ai Ø Dative kku, ukku, ku

Tamil Morphological Analyzer

Page 29: NLP Tools development for tamil language - start [ufal wiki] Differences Morphological Noun Cases Tamil suffixes English words Nominative Ø Ø Accusative ai Ø Dative kku, ukku, ku

NLP @ Amrita – Morphological Analyzer for Tamil

Tamil is agglutinative

The major inflectional categories in Tamil are nouns and verbs.

Noun morphology of Tamil is simple compared to verb morphology.

Extremely simple paradigms were used to categorize the root words.

The lexicon includes 50000 nouns and few hundred verbs.

FSTs were used to build Morphological Analyzer/generator

Figure: Morph Generator FST

Page 30: NLP Tools development for tamil language - start [ufal wiki] Differences Morphological Noun Cases Tamil suffixes English words Nominative Ø Ø Accusative ai Ø Dative kku, ukku, ku

NLP @ Amrita – Morphological Analyzer for Tamil

Lexicon FST

Rule FST

TEnI+N+PL

TEnIkkaL

Morph Generator

Rule Invert FST

Lexicon inv FST

Morph Analyzer

TEnIkkaL

TEnI+N+PL

Figure: Finite State Transducer for Morphological Processing

FST composition

Page 31: NLP Tools development for tamil language - start [ufal wiki] Differences Morphological Noun Cases Tamil suffixes English words Nominative Ø Ø Accusative ai Ø Dative kku, ukku, ku

NLP @ Amrita – Morphological Analyzer for Tamil

Figure: Morphological Analyzer screenshot

Page 32: NLP Tools development for tamil language - start [ufal wiki] Differences Morphological Noun Cases Tamil suffixes English words Nominative Ø Ø Accusative ai Ø Dative kku, ukku, ku

Thank you