**junk** (no subject)

31
ARASU ENGINEERING COLLEGE R.MUTHU KUMARAN (II-CSE) R.PANNEER SELVAM (II-ECE) AUTHORS, NATURAL LANGUAGE PROCESSING (NLP) TAMIL - HINDI CONVERSION

Upload: muthukumarantdr95

Post on 20-May-2015

300 views

Category:

Technology


1 download

TRANSCRIPT

Page 1: **JUNK** (no subject)

ARASU ENGINEERING COLLEGE

R.MUTHU KUMARAN (II-CSE)R.PANNEER SELVAM (II-ECE)

AUTHORS,

NATURAL LANGUAGE PROCESSING (NLP) TAMIL - HINDI CONVERSION

Page 2: **JUNK** (no subject)

Now a days the information is available electronically. Indeed, there has been an explosion of text and multimedia content on the World Wide Web. For many people, a large and growing fraction of work and leisure time is spent navigating and accessing this universe of information. The presence of so much text in electronic form is a huge challenge to NLP.

The Universal Networking Language (UNL) is an electronic language in the form of a semantic network that act as an intermediate representation to express and exchange every kind of information . The UNL represents information, i.e. meaning, sentence-by-sentence. Sentence information is represented as a hyper-graph having Universal Words (UWs) as nodes and relations as arcs.

INTRODUCTIONINTRODUCTION

Page 3: **JUNK** (no subject)

The text – once converted into UNL – can be converted to many different languages . For example, once a home page is expressed in UNL, it can be read in a variety of natural languages.

The meaning representation is directly available for retrieval and indexing mechanisms and tools for automatic summarization and knowledge extraction and it will be converted to a natural language only when communicating with a human user.

UNL greatly reduces the cost of developing knowledge or contents necessary for knowledge processing, by sharing knowledge and contents. Furthermore, if the type of knowledge required for doing some task is described in a language.

UNL, the software only needs to interpret unambiguous intermediate instructions written in the language to be able to perform its functions.

UNL FEATURESUNL FEATURES

Page 4: **JUNK** (no subject)

UNL

TAMIL

HINDI

FRENCH

RUSSIAN

ENCONVERSION

DECONVERSION

Page 5: **JUNK** (no subject)

EnconverterAnalysis

RulesDictionary

W WW W W

ni ni+1 ni+2Node List

V

TM

N

GM

Node-net

ni-1 ni+3

Page 6: **JUNK** (no subject)

Currently we have many analysis for language conversion:

Aspects Model Standard Theory

Extended Standard Theory (EST)

Page 7: **JUNK** (no subject)

ASPECTS MODEL STANDARD THEORYASPECTS MODEL STANDARD THEORY

It was in the Aspects of the Theory of Syntax nouns are chosen on the basis of context free rules ; verbs are then chosen on the basis of context sensitive rules, which are the terms to express the lexical features. Since nouns are the first words to be chosen, they are identified by lexical features only. Verbs and adjectives require additional features to indicate the environments in which they can appear. Aspects of grammar was organized into three major components:

Page 8: **JUNK** (no subject)

EXTENDED STANDARD THEORYEXTENDED STANDARD THEORY

Ray Jackendoff offered a substantial criticism to the Standard Theory and showed that surface structure played a much more important role in semantic interpretation than the Deep structure. Here the partial representation of meaning is determined by grammatical structure. The derivation of logical form proceeds step by step which is determined by a derivational process analogous to those of syntax and phonology.

Page 9: **JUNK** (no subject)

ADVANTAGESADVANTAGES

Developing Machine Translation (MT) systems between Tamil and other languages particularly English and Hindi

Building lexical resources in Tamil that are essential for researchers and developers

Developing basic tools for computational work in Tamil, such as morph analyzer, Part-Of-Speech (POS) tagger etc.

Application of NLP tools for Information Extraction from domain specific texts so as to build Information Extraction systems for various domains such as medicine, agriculture etc.

Page 10: **JUNK** (no subject)

The choice of Tamil-Hindi MAT is because, both are Free word-order languages unlike English which is a positional language. Ultimately our aim is to built a Human Aided Machine Translation System for Hindi-Tamil. A MT system basically has three major components.

TAMIL-HINDI SYSTEMTAMIL-HINDI SYSTEM

Tamil WordTamil Word MAMA

GeneratorGeneratorHindi WordHindi Word

Mapping UnitMapping UnitTamil to Hindi TranslationTamil to Hindi Translation

Page 11: **JUNK** (no subject)

Morphological Analyser (MA)Morphological Analyser (MA) SplitingSpliting

WordWord WordWord WordWord

MorphonsMorphons MorphonsMorphons

Tamil SentenceTamil Sentence

Page 12: **JUNK** (no subject)

MorphonsMorphons

Root wordRoot word Help wordHelp word

Tense makerTense maker GNP makerGNP maker VibakthiVibakthi

Page 13: **JUNK** (no subject)

Example : “ ”

Page 14: **JUNK** (no subject)

DictionaryDictionary

MorphonsMorphons

ConvertionConvertion

GeneratorGenerator

Mapping UnitMapping Unit

Page 15: **JUNK** (no subject)
Page 16: **JUNK** (no subject)

GeneratorsGenerators

Root wordRoot word Help wordHelp word

WordWord WordWord WordWord

SentenceSentence

Page 17: **JUNK** (no subject)
Page 18: **JUNK** (no subject)
Page 19: **JUNK** (no subject)

In this paper the development of Tamil – Hindi Translation is described. In Tamil most information for generating sentence from UNL structure is tackled in morphological and syntactical level.

The humble one could potentially alleviate for the most pressing issues of the NLP. The application of NLP is vast like ocean. We see a little drop of that ocean. In the feature NLP helps to comfortably communicate with computer.

CONCLUSIONCONCLUSION

Page 20: **JUNK** (no subject)
Page 21: **JUNK** (no subject)
Page 22: **JUNK** (no subject)
Page 23: **JUNK** (no subject)
Page 24: **JUNK** (no subject)

Morphological Analysis

Semantic Analysis

Page 25: **JUNK** (no subject)

Natural Access to Internet & Other ResourcesHeadline GenerationHeadline TranslationDocument TranslationMultilingual Multi document Summarization

Cross-lingual Information ManagementMultilingual and Cross-lingual IROpen Domain Question Answering

Page 26: **JUNK** (no subject)

Name the component: Morphological Analyzer For Tamil Morphological Analyzer For Hindi

( would like to collaborate with consortium)

The performance of these techniques in other languages?Kimmo Analyser –95% English

For Tamil Morphological Analyser : Present Performance is 92%

1st Year : 96% 2nd Year : 98-99%

Language pair: Tamil –Hindi

Page 27: **JUNK** (no subject)

Name the component: POS Tagger

The performance of these techniques in other languages.English Brills Tagger 99%

Tamil: Present Performance: 90+%• 1st Year : 96%; • 2nd Year : 98-99%

Language pair: Tamil –Hindi

Evaluation metrics in addition to the domain: Precision and Recall

Page 28: **JUNK** (no subject)

Name the component: : NP Chunker

The performance of these techniques in other languages:FnTBL 98%

Tamil : Present Performance: 94+%1st Year : 96%;

2nd Year : 98-99%

Language pair: Tamil –Hindi

Name the domain for which the performance will be optimized:Crime/ Tourism

Name other evaluation metrics in addition to the domain: Precision and Recall

Page 29: **JUNK** (no subject)

Name the component: :Transfer Grammar Component

The performance of these techniques in other languages?NA

Tamil : Present Performance: 50% 1st Year : 90%;

2nd Year : 95 and above

Language pair: Tamil –Hindi

Name other evaluation metrics in addition to the domain: Precision and Recall

Page 30: **JUNK** (no subject)

Name the component: : Word Generator and Local Language Splitter for Target Language

Present Performance: 50%

1st Year : 90%; 2nd Year : 95 and above

Language pair: Tamil –Hindi

Name other evaluation metrics in addition to the domain: Precision, Recall and F measure

Page 31: **JUNK** (no subject)

Name the lexical resource: Hindi- Tamil Bilingual Dictionary

The final size of the lexical resource?30,000 root word

The average size of such a resource in other languages20, 000 root words

1st Year 15,000 root words2nd Year15, 000 root words

Language pair: Tamil -Hindi