summer research project (anusaaraka) report

23
1 Abstract Anusaaraka is an English Hindi language accessing software. With insights from Panini's Ashtadhyayi (Grammar rules), Anusaaraka is a machine translation tool being developed by the Chinmaya International Foundation (CIF), International Institute of Information Technology, Hyderabad (IIIT-H) and University of Hyderabad (Department of Sanskrit Studies). Fusion of traditional Indian shastras and advanced modern technologies is what Anusaaraka is all about. Anusaaraka allows users to access text in any Indian language, after translation from the source language (i.e. English or any other regional Indian language). In today's Information Age large volumes of information is available in English whether it be information for competitive exams or even general reading. However, a lot of the educated masses whose primary language is Hindi or a regional Indian language are unable to access information in English. Anusaaraka aims to bridge this language barrier by allowing a user to enter an English text into Anusaaraka and get the translation of the same in an Indian language. The Anusaaraka being referred to here has English as the source language and Hindi as the target language. Anusaaraka derives its name from the Sanskrit word ‘ Anusaran’ which means ‘to follow’ . It is so called, as the translated Anusaaraka output appears in layers i.e. a sequence of steps that follow each other till the final translation is displayed to the user.

Upload: anwar-jameel

Post on 14-Apr-2017

25 views

Category:

Software


0 download

TRANSCRIPT

Page 1: Summer Research Project (Anusaaraka) Report

1

Abstract

A n us aa r ak a i s an En g l i s h – H in d i l an gu age acce ss i n g so f tw a r e . Wi t h i ns i gh t s f rom

P an in i ' s A sh t adh ya yi ( G r am mar ru l es ) , An us aa r ak a i s a m ach in e t r ans l a t io n too l

b e in g d ev e l op ed by t h e Ch i nm a ya In t e r n a t i on a l Fou nda t i on (C IF) , In t e rn a t io n a l

In s t i t u t e o f In f o rm at io n T echn o l o gy, H yd e r ab ad ( I I I T - H ) an d Un iv e rs i t y o f

H yd e r ab ad ( D ep a r tm en t o f Sans k r i t S t ud ie s ) . Fus io n o f t r ad i t i on a l In d i an s h as t r as

an d adv an ced mo d er n t e ch no lo g i es i s w h a t An us aa r ak a i s a l l abo u t .

A n us aa r ak a a l lo ws u s e r s t o a ccess t ex t i n an y In d i an l an gu age , a f t e r

t r an s l a t io n f r om th e so u r ce l an gu age ( i . e . En g l i sh o r an y o t h e r r eg io na l In d i an

l an gu age ) . In t od ay ' s In f o r m at io n A ge l a r ge v o lu m es o f i n f o rm at i on i s av a i l ab l e i n

E n g l i sh – w he th e r i t b e i n f o r m at i on f o r co mp e t i t i v e ex ams o r ev en gen e r a l r e ad ing .

H o w ev e r , a l o t o f t he ed u ca t ed m ass e s w ho se p r im a r y l an gu age i s Hin d i o r a

r eg i on a l In d i an l angu age a r e un ab l e t o a cces s i n fo r ma t io n i n En g l i sh . A nus aa r ak a

a im s to b r i d ge th i s l an guage b a r r i e r b y a l l o win g a u s e r t o en t e r an En g l i s h t ex t

i n t o A nu s aa r aka an d ge t t h e t r ans l a t i on o f t h e s ame in an In d i an l an gu age . T h e

A n us aa r ak a b e i n g re f e r r ed t o h e r e h a s E n g l i s h as t h e s our ce l an gu age and Hi nd i as

t h e t a r ge t l an gu age .

A n us aa r ak a d er iv es i t s n am e f r om t he S ans kr i t wo r d ‘ An us ar an ’ w h i ch

m eans ‘ t o f o l low ’ . I t i s so ca l l ed , a s t h e t r ans l a t ed Anu s aa rak a o u tp u t app ea r s i n

l a ye r s – i . e . a s equen ce o f s t ep s t h a t f o l lo w each o th e r t i l l t he f i n a l t r an s l a t io n i s

d i s p l ayed t o t h e u se r .

Page 2: Summer Research Project (Anusaaraka) Report

2

International Institute of Information

Technology (IIIT), Hyderabad

T h e In t e r n a t i on a l In s t i t u t e o f In f o r ma t i on T ech n o lo g y, H yd e r ab ad ( I I IT - H ) i s an

au to nom ou s un iv e rs i t y f o u n d ed in 199 8 . I t w as s e t up a s a n o t - fo r -p r o f i t p u b l i c

p r iv a t e p a r t n e r sh ip ( NPPP ) an d i s t h e f i r s t I I IT t o b e s e t u p ( un d er t h i s mo d e l ) i n

In d i a . T h e Go ve r nm en t o f A nd h r a P rad esh l en t su pp o r t t o t h e i n s t i t u t e b y g r an t o f

l and and bu i ld in gs . A Go ve r n i n g C ou n c i l con s i s t i n g o f emin en t p eo p l e f rom

acad em i a , i n du s t r y an d go v e rnm en t p re s id es o v e r t h e go ve r n ance o f t h e i n s t i t u t i on .

I I IT - H w as s e t u p a s a r es ea rch u n i ve r s i t y f o cu s ed on th e co r e a r eas o f

In f o r m a t i on T echn o l o g y, s u ch as C om put e r S c i en ce , E l ec t r on i c s and

C omm un ica t io ns , an d t h e i r app l i c a t i on s i n o t h e r do m ain s . Th e i ns t i t u t e ev o l ved

s t ro n g r e s ea r ch p r ogr am s in a h os t o f a r ea s , w i t h comp uta t i on o r IT p r o v i d in g t h e

co nn ec t in g th r ead , an d wi th an emph as i s on t h e d ev e lo pm ent o f t e ch no lo g y an d

ap p l i c a t io ns , w h i ch can be t r an s f e r r ed fo r us e t o i nd us t r y an d so c i e ty . T h i s

r eq u i r ed ca r r yi n g o u t b as i c r e s ea r ch t h a t c a n b e us ed to s o lv e r ea l l i f e p r o b l em s .

A s a r es u l t , a s yn e r g i s t i c r e l a t i on sh ip h as com e t o ex i s t a t t h e In s t i t u t e b e tw een

b a s i c an d app l i ed r e s ea r ch . Facu l t y c a r r i es o u t a n umb er o f a cad emi c in d us t r i a l

p r o j ec t s , an d a f ew com p an i e s h ave b een in cub a ted b a se d on t h e r e s ea r ch d on e a t

t h e In s t i t u t e .

I I IT - H i s o r gan ized a s r e sea r ch cen t e r s and l abs , i n s t e ad o f t he

co nv en t i on a l d ep a r t men t s , t o f a c i l i t a t e i n t e r - d i s c ip l in a r y r e s ea r ch an d a s eaml e ss

f l ow o f kn o wl ed ge w i th i n th e In s t i t u t e . Facu l t y a s s i gn ed t o t h e ce n te r s an d l abs

co nd u c t r es ea r ch , a s w e l l a s a cad emi c p r o gr ams , w h i ch a r e o w n ed b y t h e In s t i t u t e ,

an d no t b y i n d i v i dua l r e s ea r ch cen t e r s .

Page 3: Summer Research Project (Anusaaraka) Report

3

Machine Translation

M ach in e Tr ans l a t io n i s an im po r t an t t e chn o l o g y f o r l o ca l i z a t i on , an d i s

p a r t i cu l a r l y r e l ev an t i n a l i n gu i s t i c a l l y d i v e r s e c ou n t r y l i k e In d i a . H uman

t r an s l a t io n in In d ia i s a r i ch and an c i en t t r ad i t i on . Wo r ks o f p h i l os oph y, a r t s ,

m yt h o l o g y, r e l i g i on , s c i en ce an d f o lk lo r e h ave been t r an s l a t ed amo n g th e an c i en t

an d m od e r n In d i an l an gu ages . Nu m ero us c l as s i c w o rk s o f a r t , an c i en t , med i ev a l

an d m od e rn , h av e a l so been t r an s l a t ed b e t w een Eu r opean and In d i an l an gu ages

s in ce th e 1 8t h

cen t u r y . In t h e cu r ren t e r a , h um an t r an s l a t io n f i nd s app l i c a t ion

m ai n l y i n t h e adm in i s t r a t i on , m ed ia an d ed u ca t i on , an d to a l e s s e r ex t en t , i n

b us in e ss , a r t s an d s c i en ce an d t e ch no l o g y. In d i a h as a l i n gu i s t i c a l l y r i ch a r ea — i t

h a s 1 8 co ns t i t u t i ona l l an gu ages , wh ich a r e w r i t t en i n 1 0 d i f f e r en t s c r i p t s . H i nd i i s

t h e o f f i c i a l l an gu age o f t h e Un io n . E n g l i sh i s v e r y w i d e l y u s ed in t h e m ed ia ,

co mm er ce , s c i en ce an d t e ch no l o g y a n d edu ca t io n . M any o f t h e s t a t e s h av e th e i r

o w n r egi on a l l an gu age , w h i ch i s e i t h e r Hi nd i o r on e o f t h e o t he r con s t i t u t io na l

l an gu ages . On l y ab o u t 5 % o f t h e po pu l a t i on s peaks E n gl i sh . In s u ch a s i t u a t io n ,

t h e r e i s a b i g m ar k e t fo r t r an s l a t ion b e t ween E n gl i s h an d th e v a r i ou s In d i an

l an gu ages . Cu r r en t l y, t h i s t r ans l a t io n i s e s s en t i a l l y m anu a l . Us e o f au t om at io n i s

l a r ge l y r e s t r i c t ed t o w o rd p ro ces s ing . T w o s p ec i f i c ex ampl e s o f h i gh v o l ume

m anu a l t r an s l a t i on a r e — t r ans l a t i on o f n ews f ro m En g l i sh i n to lo ca l l angu ages ,

t r an s l a t io n o f annu a l r epo r t s o f go ve r nm en t d ep a r tm en t s an d p ub l i c s ec to r un i t s

am on g , E n g l i s h , Hin d i an d t h e lo ca l l an gu age .

A s i s c l e a r f rom abo v e , t h e ma r k e t i s l a r ge s t fo r t r ans l a t io n f ro m E n g l i sh

i n t o In d i an l an gu ages , p r im a r i l y H i n d i . H en ce , i t i s no s u rp r i s e t ha t a m ajo r i t y o f

t h e In d i an M ach i ne T r ans l a t io n (M T) s ys t ems a re f o r E n g l i sh - Hin d i t r ans l a t ion .

N a tu r a l l an gu age p r o ce ss i n g p r e s en t s m an y ch a l l en ges , o f w h i ch th e b i gges t i s t he

i nh e r en t am bi gu i t y o f n a tu r a l l an gu age . M T s ys t ems h ave t o d ea l wi th ambi gu i t y,

an d va r io us o th e r N L p h en o m en a . In ad d i t i o n , t h e l i n gu i s t i c d i ve r s i t y b e t w een t he

s ou r ce an d t a r ge t l an gu age m ak es M T a b i gge r ch a l l en ge . T h i s i s p a r t i cu l a r l y t r ue

o f wi d e l y d i v e r gen t l an gu ages s uch as E n gl i s h and In d ian l an gu ages . Th e m aj o r

s t ru c t u ra l d i f f e r ence b e tw een E n g l i sh an d In d i an l an guages can b e s umm ar iz ed a s

Page 4: Summer Research Project (Anusaaraka) Report

4

f o l lo w s . E n gl i s h i s a h i gh l y p o s i t i on a l l an gu age wi t h r u d im en t a r y m o r ph o l o g y,

an d d e f au l t s en t en ce s t ru c tu r e . In d i an l an gu ages a r e h i gh ly i n f l e c t i on a l , w i th a r i ch

m o rp ho l o g y, r e l a t i v e l y f r e e w o r d o r de r , and d e f au l t s en t en ce s t ru c t u r e . In ad d i t i o n ,

t h e r e a r e m an y s t y l i s t i c d i f fe r en ces . Fo r ex amp le , i t i s co mmo n t o s ee v e ry l o n g

s en t en ces i n E n gl i s h , u s i n g ab s t r ac t con cep t s as t h e su b j ec t s o f s en t en ces , and

s t r i n g i n g s ev e r a l c l au s es t o ge th e r ( a s i n t h i s s en t en ce ! ) . Su ch con s t ru c t io ns a r e n o t

n a tu r a l i n In d i an l an gu ages , and p re s en t m ajo r d i f f i cu l t i e s i n p ro du c i ng go o d

t r an s l a t io ns .

A s i s r e co gn iz ed the w o r ld o v e r , wi t h t h e cu r r en t s t a t e o f a r t i n M T, i t i s

n o t p os s i b l e t o h ave Fu l l y A u t om at i c , H i gh Q ua l i t y , an d G en er a l - Pu rp os e M ach ine

T r ans l a t io n . P rac t i c a l s ys t ems n eed to h an d l e am bi gu i t y an d t h e o th e r compl ex i t i e s

o f na tu r a l l an guage p ro ce ss i n g , b y r e l ax in g o n e o r mo r e o f t h e ab ov e d im en s i ons .

T h us , w e can h ave au t om at i c h i gh -q u a l i t y ‘ s u b - l an guage ’ s ys t ems fo r s p ec i f i c

d om ai ns , o r au tom at i c gen e r a l -p u rp os e s ys t ems g i v i n g r o u gh t r ans l a t io n , o r

i n t e r ac t i v e gen e r a l - p u rp os e s ys t em s wi th p r e o r po s t ed i t i n g .

Wh y M ach i n e T r ans l a t i on?

T o da y t e ch n o l o g y h a s mad e i t po ss ib l e f o r i n d iv id ua l s w o r l d wid e to a cce s s l a r ge

v o l um es o f i n f o rm at io n a t t h e c l i ck o f a b u t t on . H o w ev e r , v e r y o f t en t he

i n fo rm at io n so u gh t m ay n o t b e in a l an gu age th a t t he i nd i v i du a l i s f ami l i a r w i t h .

T h us , M ach in e T r an s l a t io n i s an endeav o r t o m inim iz e t h e l angu ag e ba rr i er , b y

m ak in g i t p os s ib l e t o a cces s a t ex t i n t h e l an gu age o f o ne ' s cho i ce . Fo r t e ch n o lo g y

t o b e ab l e t o p ro v id e t h e ab ov e f ac i l i t y , m an y a s p ec t s o f l an gu age a r e i n vo lv ed .

T o n am e a f ew :

• S cr ip t

• Sp e l l i n g

• V o cab u l a r y

• Mo r ph o l o g y

• S yn t ax

Page 5: Summer Research Project (Anusaaraka) Report

5

K eep i n g th e abo ve i n mi nd , m ach in e t r ans l a t i on s ys t em s n eed t o be

eq u i pp ed to t r ans l a t e a t ex t wi t h i n s eco nd s an d ye t c ap tu r e t h e i n fo rm at ion o f t he

t ex t t o t he b e s t p oss ib l e ex t en t .

Page 6: Summer Research Project (Anusaaraka) Report

6

Anusaaraka

T h e f o cu s in An usaa r ak a i s n o t m ai n l y o n m ach i ne t r an s l a t io n , b u t o n Lan gu age

A ccess be tw een In d i an l an gu ages . U s i n g p r in c i p l es o f Pan in i an G r am m ar (P G ) , and

ex p l o i t i n g t h e c l ose s im i l a r i t y o f In d i an l an gu ages , Anu s aa rak a es s en t i a l l y m ap s

l o ca l w or d g ro up s b e t w een t h e so urce and t a r ge t l angu ages . Wh er e the r e a re

d i f f e r en ces b e tw een th e l an gu ages , t h e s ys t em i n t ro d uces ex t r a no t a t io n to

p r e s e r v e th e i n fo rm at io n o f t he sou r ce l an guage . Thu s , t h e u s e r n eeds som e

t r a i n i n g t o un de r s t and th e ou t pu t o f t h e s ys t em . The p r o j ec t h as d eve l op ed

Lan gu age A cces s o rs f rom m an y In d i an l an gu a ges in t o Hi nd i .

A n us aa r ak a m aps co ns t r u c t i o n s in t h e s ou r ce l an gu age to t he

co r r es po nd in g co ns t ru c t i on s i n t he t a r ge t l an gu age w h e r ev e r po ss i b l e . For

ex ampl e , a no un o r p ro no un in t h e so u r ce l an gu age i s m app ed to an app ro p r i a t e

n o un o r p ro no un , r e s p ec t i v e l y, i n t h e t a r ge t l an gu age as s ho w n b e lo w:

@ H : Ap a p us t ak a paD h a_r aH A_[ HE | t h A ] _ k yA { 2 3 _b a .}?

! E : Yo u bo ok r ead _ i n g_[ i s |w as ] Q .?

E : A r e / w er e yo u r ead in g a b oo k?

( Wh er e th e p r e f ix es m ean t h e fo l l ow in g :

@ H =anu saa r aka Hin d i , ! E =E n gl i s h g lo s s , E =E n g l i s h . )

In t h e ex amp l e abo v e , t h e l a s t w o r d i n t h e s en t en ce i s a v e r b and i l l u s t r a t es t h e

m app i n g mo r ph eme b y m o r p h em e: t h e r oo t i s m ap p ed to ' p aD h a ' ( r e ad ) , and

s imi l a r l y t h e t en se - a s pec t - mo d a l i t y ( T A M) l ab e l i s map p ed t o ' r aH A_[ H E | t h A ] '

( i s_ * in g o r w as _* in g) , w h i ch i s f o l l ow ed b y 'A ' s u f f ix w h ich ge t s m app ed t o ' k yA '

( w h a t ) a s a q ues t i on m ark in Hi nd i . Gen d e r , nu mb e r , and p e r so n (G NP ) i n fo r m at ion

i s a l so s ho wn s ep ar a t e l y i n cu r l y b r a ck e t s ( ' { 23 _b a .} ' f o r s eco nd o r t h i rd p e r son

an d p l u ra l ) .

Page 7: Summer Research Project (Anusaaraka) Report

7

S om et i m es , f o r a co ns t r u c t i on in t h e so u r ce l an gu age , t h e s am e

co ns t r u c t i on i s no t av a i l ab l e i n t h e t a r ge t l an gu age . In s uch a c as e , t h e s ys t em

ch oo s es ano t h e r con s t r uc t io n i n t h e t a r ge t l an gu age in wh i ch th e sam e in f o r m at ion

can b e ex p r e s s ed . In t h e e x am pl e be l o w, t h e s ys t em ch os e s t h e comp l em en t i z e r

co ns t r u c t i on i n Hi nd i ( Es A ) t o ex p r es s t h e s am e s en s e :

@ H : h am Ar A _ l ad ak I_ k o ` n O k a r I k a r an A _E sA n ah IM _[ hE |W A ] .

! E : Ou r d au gh t e r (da t . ) j ob d o_ sh ou ld _ t h a t no t ( f em . )

E : I t i s no t t h e case th a t o u r d au gh t e r sh ou ld ge t a j o b .

H o w ev e r , A nu saa r ak a sh ow s th e i m age an d th e r e fo r e , i t u s es t h e com pl emen t i z e r

( E s A) . S om et i m es th e r e a r e s l i gh t d i f fe r en ce s b e t w een a co ns t r u c t i on in t h e s o ur ce

l an gu age to a s im i l a r co ns t r u c t i on i n t h e t a r ge t l an gu age b ecaus e o f w h i ch

i n fo rm a t io n m i gh t n o t b e p r e s e r v ed . In s u ch a s i t u a t i on ad d i t i o na l no ta t i on i s

i n t ro du ced t o ex p re s s t h e i n fo rm at ion w hi ch wo u l d o the r w i s e ge t l o s t . A s im pl e

ex ampl e o f t h i s i s t h e l a ck o f d i s t i n c t io n b e t ween p e r so na l p ro no un and p ro no mi na l

ad j ec t i ve i n Hi nd i : v ah a .

@ H : v ah a ` pA T hshA l A_ ko ` ga yA .

! E : h e scho o l ( d a t . ) w en t .

E : H e w en t t o s choo l .

@ H : v ah a - pA T hshA l A_ ko ` T ro ph I A yI .

! E : t h a t s ch oo l ( da t . ) t ro ph y cam e

E : Th a t s cho o l r e ce i v ed th e t ro ph y.

Wh en t r an s fe r r i n g f r om o ne l an gu age t o t he o t h e r , t h i s d i s t i n c t io n w ou l d h ave

d i s app ea r ed , i f c a r e w as n o t t ak en . I n A n u s aa r ak a , t h e t wo f o rm s a re m ad e

d i f f e r en t b y i n t ro du c in g add i t i o n a l n o t a t io n :

v ah a ` (h e )

v ah a - ( t h a t )

Page 8: Summer Research Project (Anusaaraka) Report

8

Salient Features of Anusaaraka

Faithful representat ion of text in source language:

T h ro u gho u t t h e v a r i ou s l aye r s o f A nu s aa rak a o u t p u t t he r e i s an e f f o r t t o en su r e

t h a t t h e us e r sh ou ld b e ab l e t o un de r s t and th e in f o r m at i on co n t a i ned i n t h e E n g l i sh

s en t en ce . Th i s i s g i v en g r ea t e r imp o r t an ce t han g i v i n g pe r f ec t s en t en ces i n H in d i ,

f o r i t wo u l d b e po i n t l e s s t o h av e a t r an s l a t i on th a t r e ad s w e l l bu t d o es no t t r u l y

cap tu r e t h e in f o r m at io n o f t h e so ur ce t ex t .

T h e l a ye r ed o u tp u t i s un iq u e to An usaa r ak a . Th us , s ou rce l an gu age t ex t

i n fo rm at io n and how th e Hin d i t r an s l a t io n i s f i n a l l y a r r i v ed a t c an b e acces s ed b y

t h e us e r . T h e i mpo r t an t f e a tu r e o f t h e l a ye r ed o u tp u t i s t h a t t he i n f o rm at ion

t r an s f e r i s d on e in a co n t r o l l ed m anne r a t ev er y s t ep thu s , m ak in g i t p os s i b l e t o

r ev e r t back w i th ou t an y l o s s o f i n f o r m at io n . Al s o , an y l os s o f i n f o r m at io n t h a t

c an no t b e av o i ded in a t r ans l a t ion p ro ces s i s t h en d on e in a g r adua l w ay.

T h e r e f o r e , ev en i f t h e t r an s l a t ed s en ten ce i s no t a s ' p e r f ec t ' a s h um an t r ans l a t i on ,

w i t h s om e e f fo r t an d o r i en t a t io n o n read in g A nu s aar ak a o u t pu t , an in d i v i du a l can

u n de r s t an d w h a t t he s ou r ce t ex t i s imp l yi n g b y l o o k in g a t t h e l aye r s an d co n t ex t i n

w h ich th a t s en t en ce app ea rs .

Reversibi l i ty:

T h e f ea t u r e o f g r ad u a l t r an s f e r en ce o f i n f o rm at i on f rom o n e l a ye r t o t he n ex t ,

g i v e s A nu s aar ak a an add i t i on a l ad v an t age o f b r i ng i n g r ev e r s ib i l i t y i n t he

t r an s l a t io n p ro ce ss – a f e a t u r e w h i ch cann o t be ach i ev ed b y a co n v en t io na l

m ach in e t r an s l a t i on s ys t em. A b i - l i ngu a l u se r o f A nu saa r ak a can , a t an y p o in t ,

a cce ss t h e s ou r ce l an gu age t ex t i n E n g l i sh , b ecau s e o f t he t r ans p ar en cy i n t h e

o u t pu t . S om e amo un t o f o r i en t a t io n on h o w to r ead t h e An us aa r ak a ou t pu t wo u l d b e

r eq u i r ed f o r t h i s .

Page 9: Summer Research Project (Anusaaraka) Report

9

Transparency:

D i s p l a y o f s t ep - b y- s t ep t r an s l a t io n l a ye r s g iv es an in c r eas ed l ev e l o f con f i den ce to

t h e en d -u s e r , a s he can t r a ce b ack t o t h e s ou r ce and ge t c l a r i t y r e ga r d i n g t r an s l a t ed

t ex t b y an a l ys i s o f t h e ou t pu t l a ye r s an d so m e r e f e r en ce to co n t ex t .

Page 10: Summer Research Project (Anusaaraka) Report

10

Champollion

C h amp ol l io n i s a R obu s t Para l l e l Tex t S en t en ce A l ign er . P a r a l l e l t ex t i s a v e r y

v a lu ab l e r es ou r ce f o r a n um b er o f na t u r a l l an gu age p ro ce ss i n g t as ks , i nc l ud i n g

m ach in e t r an s l a t io n , c ro s s l an gu age i n fo rm a t io n r e t r i ev a l , and w o r d

d i s am bi gu a t io n . P ar a l l e l t ex t p ro v id es t h e m ax im um u t i l i t y w h en i t i s s en t en ce

a l i gn ed . Th e s en ten ce a l i gn m en t p r o ce s s maps s en t en ces i n t h e so ur ce t ex t t o t h e i r

t r an s l a t io n . Th e l ab o ur i n t ens i v e an d t i m e con su min g na t u r e o f m an u a l s en t en ce

a l i gn m en t m ak es l a r ge p a r a l l e l t ex t co r p us d ev e l opm ent d i f f i cu l t . Th us a n um b er o f

au tom a t i c s en t en ce a l i gnm en t ap pr o ach es h av e b een p ro p os ed and u t i l i z ed ; s om e

a r e p u r e l en g t h b ased ap p ro aches , some a r e l ex i co n b as ed , and s om e a r e a mix t u r e

o f t he t w o ap p ro ach es .

Wh i l e ex i s t i n g app r o ach es p e r fo r m r ea s on ab l y w e l l on c l os e l an gu age

p a i r s , t h e i r p e r fo rm an ce d egr ad es qu ick l y o n r emot e l angu age p a i r s s uch as E n g l i sh

an d Ch i n es e . P e r fo r m an ce d egr ad a t i on i s ex ace rb a t ed b y n o i s e i n t h e d a t a .

C h amp ol l io n w as i n i t i a l l y d ev e l op ed f o r a l i gn i n g Ch in ese - En g l i sh

p a r a l l e l t ex t . I t w as l a t e r po r t ed to o t h e r l an gu age pa i r s , i n c l ud in g A rab i c –

E n g l i sh and Hi nd i – En g l i sh .

C h amp ol l io n d i f f e r s f rom o th e r s en t en ce a l i gn e rs i n two w ays . F i r s t , i t

a s su mes a n o i s y i n p u t , i . e . a l a r ge p e r cen t age o f a l i gn men t s wi l l n o t be one t o o ne

a l i gn m en t s , and t h a t t h e n um ber o f d e l e t i on s an d in s e r t i on s wi l l b e s i gn i f i c an t . The

a s su mpt io n i s a ga in s t d ec l a r in g a m at ch i n t h e ab sen ce o f l ex i ca l ev id ence . N on -

l ex i ca l measu r es , su ch as s en t en ce l eng t h in f o r m at i on – w h ich a r e o f t en unr e l i ab l e

w h en d ea l in g w i t h n o i s y d a t a – c an an d sh ou l d s t i l l b e u s ed , b u t t h e y s h ou ld on l y

p l ay a s u pp or t i n g r o l e w h e n l ex i ca l ev id en ce i s p r e sen t . S eco nd , C h am po l l ion

d i f f e r s f ro m o t h er l ex i co n - b as ed app r o ach es i n a s s i gn i n g w ei gh t s t o t r an s l a t ed

w o r ds . T r ans l a t io n l ex i con s u su a l l y h e lp s en ten ce a l i gn e r s i n t h e fo l lo win g w a y:

f i r s t , t r an s l a t ed wo r ds a r e i d en t i f i ed b y u s i n g en t r i e s f r o m a t r an s l a t io n l ex i con ;

Page 11: Summer Research Project (Anusaaraka) Report

11

s econ d , s t a t i s t i c s o f t r ans l a t ed wo r ds a r e t h en u sed to i d en t i f y s en t en ce

co r r es po nd en ces .

In m o s t ex i s t i n g sen t en ce a l i gnm en t a l go r i th ms , t r an s l a t ed wo rd s a re

t r e a t ed eq ua l l y, i . e . t r an s l a t ed w o rd pa i r s a r e a s s i gn ed eq u a l w e i gh t w h en dec i d in g

s en t en ce co r r es po nd en ces . Fo r ex amp le , 1 - 1 a l i gnm en t co ns t i t u t es 8 9 % o f t h e UBS

E n g l i sh - F r en ch co r p us an d 1 -0 and 0 - 1 a l i gnm en t s co n s t i t u t e m e r e ly 1 . 3 % .

H o w ev e r , wh en c rea t i n g v e r y l a r ge p a r a l l e l co rp or a , t h e d a t a c an b e v e ry n o i s y.

Fo r ex ampl e , i n a U N C hin es e En g l i sh co r pu s , 6 .4 % o f a l l a l i gn m en t s a r e e i t h e r 1 -

0 o r 0 - 1 a l i gnm en t .

S om e o f t h e om is s i on s an d in s e r t i o ns w er e i n t ro duced d u r in g th e

t r an s l a t io n o f t h e t ex t . Mo s t o f t h e o mis s i on s an d in s e r t i o ns , ho w ev e r , a re

i n t ro du ce d d u r in g d i f f e r en t s t age s o f p ro ces s i n g b efo r e s en t en ce a l i gnm en t i s

c a r r i ed ou t . T h e p re - p r oce ss i n g s t eps i n c lu d e con v er t in g t h e r aw d a ta t o p l a i n t ex t

f o rm a t , r emo vi n g t ab l es , fo o t n o t es , en d n o t es , e t c . Mos t o f t h es e s t eps i n t ro du ce

n o i s e . Fo r i n s t an ce , w h i l e a t ab l e i n an E n g l i s h do cu m ent c an b e com pl e t e l y

r emo v ed , t h i s i s no t n eces sa r i l y t h e ca s e i n an y g i v en C h i nes e d ocum ent . Becau se

o f t h e s hee r nu mber o f d o cum ent s i n v o lv ed , m an u a l l y ex ami n in g each do cum ent

a f t e r p r e - p ro ces s i ng i s im po ss ib l e . A r o bu s t s en t en ce a l i gne r needs no t on l y t o

d e t ec t m os t c a t egor i es o f no i s e , b u t a l so to r e cov e r qu ick l y i f an e r r o r i s m ad e . I t

h a s b een p r o ved th a t ex i s t i n g m et hod s wo r k ve r y w e l l o n c l ean d a t a , b u t t h e i r

p e r f o r man ce go es do w n qu ick l y a s da t a b ecom es no i s y.

Page 12: Summer Research Project (Anusaaraka) Report

12

CODES

Code for extracting regular text from xml file:

#include<stdio.h>

#include<string.h>

#include<stdlib.h>

//MAXIMUM NUMBER OF PAGES ALLOWED

#define MAX 200

//EXTENSION OF THE FILES BEING CREATED FOR EACH PAGE

#define EXTENSION ".xml"

//LENGTH OF THE EXTENSION OF THE FILE

#define EXTENSION_LENGTH strlen(EXTENSION)

char temp[MAX];

//EXACT NUMBER OF PAGES IN THE SOURCE XML FILE

int totalPages;

//CONTAINS THE CURRENT PAGE NUMBER CONVERTED TO ITS CORRESPONDING FILENAME

char pageNumber[20];

//FILE POINTERS FOR READING THE PAGE FILE AND WRITING TO FINAL TEXT FILES

//TWO TEXT FILES ARE CREATED

//ONE FOR NON SORTED AND THE OTHER FOR SORTED DATA ACCORDING TO CO-

ORDINATES OF THE TEXT ON THE PAGE

FILE *fr,*fw;

//STRUCTURE FOR THE CONTENTS OF A SINGLE LINE OF THE XML FILE

struct Line

{

int top;

int left;

int width;

int height;

int font;

char text[10000];

};

//STRUCTURE FOR THE CONTENTS OF A SINGLE PAGE OF XML FILE

struct Page

{

struct Line line[MAX];

int lines;

};

//STRUCTURE FOR THE PAGE HEADER

struct Header

{

int fontId;

char fontSize[10];

char color[10];

struct Header *link;

};

typedef struct Header* HEADER;

struct Page pages[MAX];

HEADER head;

//CONTAINS THE FONTS FOR WHICH THE TEXT IS TO BE EXTRACTED

int fonts[MAX];

//CONTAINS TOTAL NUMBER OF FONTS

int totalFonts;

HEADER getHeader()

{

return((HEADER)malloc(1*sizeof(struct Header)));

}

Page 13: Summer Research Project (Anusaaraka) Report

13

void generatePages(char arg[50])

{

char arr[100]="./genPages.out ";

strcat(arr,arg);

printf("Creating Pages\n");

system("cc genPages.c -o genPages.out");

system(arr);

printf("Pages created\n");

}

void convertToText(int page)

{

int l,i,j;

char rev[20];

i=0;

while(page!=0)

{

rev[i++]=(page%10)+48;

page=page/10;

}

l=i;

i--;

for(j=0;j<l;j++)

{

pageNumber[j]=rev[i--];

}

for(i=0;i<EXTENSION_LENGTH;i++)

pageNumber[i+l]=EXTENSION[i];

pageNumber[i+l]='\0';

}

void fetchHeader()

{

int c,i;

HEADER t,cur;

while(1)

{

c=getc(fr);

if(c=='<')

{

c=getc(fr);

if(c=='f')

{

t=getHeader();

t->link=NULL;

while(!isdigit(c=getc(fr)));

i=0;

while(isdigit(c))

{

temp[i++]=c;

c=getc(fr);

}

temp[i]='\0';

t->fontId=atoi(temp);

while(!isdigit(c=getc(fr)));

i=0;

while(isdigit(c))

{

Page 14: Summer Research Project (Anusaaraka) Report

14

t->fontSize[i++]=c;

c=getc(fr);

}

t->fontSize[i]='\0';

while((c=getc(fr))!='#');

c=getc(fr);

i=0;

while(c!='\"')

{

t->color[i++]=c;

c=getc(fr);

}

t->color[i]='\0';

if(head==NULL)

{

head=t;

}

else

{

cur=head;

while(cur->link!=NULL)

cur=cur->link;

cur->link=t;

}

while(getc(fr)!='>');

}

else

break;

}

}

}

int checkLineEnd()

{

int c,i;

i=0;

while((c=getc(fr))!='>')

temp[i++]=c;

temp[i]='\0';

if(strcmp(temp,"/text")==0)

return(1);

return(0);

}

void fetchText(int pgNo)

{

int c,i;

i=0;

while(1)

{

c=getc(fr);

if(c=='<')

{

if(checkLineEnd())

break;

else

Page 15: Summer Research Project (Anusaaraka) Report

15

continue;

}

pages[pgNo].line[pages[pgNo].lines].text[i++]=c;

}

pages[pgNo].line[pages[pgNo].lines].text[i]='\0';

}

void fetchPageInfo(int pgNo)

{

int c,i;

c=getc(fr);

while(c!=EOF)

{

while(!isdigit(c=getc(fr)));

i=0;

while(isdigit(c))

{

temp[i++]=c;

c=getc(fr);

}

temp[i]='\0';

pages[pgNo].line[pages[pgNo].lines].top=atoi(temp);

while(!isdigit(c=getc(fr)));

i=0;

while(isdigit(c))

{

temp[i++]=c;

c=getc(fr);

}

temp[i]='\0';

pages[pgNo].line[pages[pgNo].lines].left=atoi(temp);

while(!isdigit(c=getc(fr)));

i=0;

while(isdigit(c))

{

temp[i++]=c;

c=getc(fr);

}

temp[i]='\0';

pages[pgNo].line[pages[pgNo].lines].width=atoi(temp);

while(!isdigit(c=getc(fr)));

i=0;

while(isdigit(c))

{

temp[i++]=c;

c=getc(fr);

}

temp[i]='\0';

pages[pgNo].line[pages[pgNo].lines].height=atoi(temp);

while(!isdigit(c=getc(fr)));

Page 16: Summer Research Project (Anusaaraka) Report

16

i=0;

while(isdigit(c))

{

temp[i++]=c;

c=getc(fr);

}

temp[i]='\0';

pages[pgNo].line[pages[pgNo].lines].font=atoi(temp);

printf("Fetching text for line %d\n",pages[pgNo].lines);

c=getc(fr);

fetchText(pgNo);

pages[pgNo].lines++;

c=getc(fr);

c=getc(fr);

}

}

void fetchFontId(int argc,char *argv[])

{

int i;

HEADER cur;

for(i=3;i<argc-1;i=i+2)

{

cur=head;

while(cur!=NULL)

{

if((strcmp(argv[i],cur-

>fontSize)==0)&&(strcmp(argv[i+1],cur->color)==0))

{

fonts[totalFonts++]=cur->fontId;

}

cur=cur->link;

}

}

}

void createPages()

{

int i;

for(i=1;i<=totalPages;i++)

{

convertToText(i);

fr=fopen(pageNumber,"r");

if(fr==NULL)

{

printf("Cannot open the file %s\nExitting\n",pageNumber);

exit(0);

}

printf("Fetching information of Page %d\n",i);

fetchHeader();

pages[i].lines=0;

fetchPageInfo(i);

printf("Information of Page %d fetched\n",i);

fclose(fr);

}

}

int checkFont(int fnt)

{

int i;

Page 17: Summer Research Project (Anusaaraka) Report

17

for(i=0;i<totalFonts;i++)

{

if(fnt==fonts[i])

return(1);

}

return(0);

}

void sortPage(int pgNo)

{

struct Line temp;

int i,j;

for(i=0;i<pages[pgNo].lines-1;i++)

{

for(j=i+1;j<pages[pgNo].lines;j++)

{

if(pages[pgNo].line[i].top>=pages[pgNo].line[j].top)

{

if(pages[pgNo].line[i].top==pages[pgNo].line[j].top)

{

if(pages[pgNo].line[i].left>pages[pgNo].line[j].left)

{

temp=pages[pgNo].line[i];

pages[pgNo].line[i]=pages[pgNo].line[j];

pages[pgNo].line[j]=temp;

}

}

else

{

temp=pages[pgNo].line[i];

pages[pgNo].line[i]=pages[pgNo].line[j];

pages[pgNo].line[j]=temp;

}

}

}

}

}

void writeText(int pgNo)

{

int i;

for(i=0;i<pages[pgNo].lines;i++)

{

if(checkFont(pages[pgNo].line[i].font))

{

fputs(pages[pgNo].line[i].text,fw);

putc('\n',fw);

}

}

}

void createTextFile(char arg[MAX])

{

int i;

for(i=1;i<=totalPages;i++)

{

writeText(i);

}

fclose(fw);

strcat(arg,"_sorted");

fw=fopen(arg,"w");

Page 18: Summer Research Project (Anusaaraka) Report

18

if(fw==NULL)

{

printf("Cannot create the file %s\nEXITING\n",arg);

return;

}

for(i=1;i<=totalPages;i++)

{

sortPage(i);

writeText(i);

}

}

main(int argc,char *argv[])

{

totalPages=atoi(argv[2]);

generatePages(argv[1]);

head=NULL;

createPages();

totalFonts=0;

fetchFontId(argc,argv);

fw=fopen(argv[argc-1],"w");

if(fw==NULL)

{

printf("Cannot create the file %s\nEXITING\n",argv[argc-1]);

return(0);

}

createTextFile(argv[argc-1]);

fclose(fw);

return(0);

}

Code for dividing the xml into pages in accordance with the .pdf file used:

#include<stdio.h>

#include<string.h>

#define MAX 2000

#define START_PATTERN "<page"

#define START_PATTERN_LENGTH strlen(START_PATTERN)

#define END_PATTERN "</page>"

#define END_PATTERN_LENGTH strlen(END_PATTERN)

#define EXTENSION ".xml"

#define EXTENSION_LENGTH strlen(EXTENSION)

FILE *fr,*fw;

char temp[MAX];

char pageNumber[20];

void convertToText(int page)

{

int l,i,j;

char rev[20];

i=0;

while(page!=0)

{

rev[i++]=(page%10)+48;

page=page/10;

}

Page 19: Summer Research Project (Anusaaraka) Report

19

l=i;

i--;

for(j=0;j<l;j++)

{

pageNumber[j]=rev[i--];

}

for(i=0;i<EXTENSION_LENGTH;i++)

pageNumber[i+l]=EXTENSION[i];

pageNumber[i+l]='\0';

}

int skip()

{

int i,c;

for(i=0;i<START_PATTERN_LENGTH;i++)

{

c=getc(fr);

if(c==EOF)

return(EOF);

temp[i]=c;

}

temp[i]='\0';

do

{

if(strcmp(temp,START_PATTERN)==0)

{

c=getc(fr);

while(c!='>')

c=getc(fr);

c=getc(fr);

return(1);

}

c=getc(fr);

if(c==EOF)

return(EOF);

for(i=0;i<START_PATTERN_LENGTH-1;i++)

temp[i]=temp[i+1];

temp[i++]=c;

temp[i]='\0';

}while(1);

}

int checkPageEnd()

{

int i;

for(i=0;i<END_PATTERN_LENGTH;i++)

{

temp[i]=getc(fr);

}

temp[i]='\0';

if(strcmp(temp,END_PATTERN)==0)

return(1);

return(0);

}

main(int argc,char *argv[])

{

int c,r,i,page;

page=1;

fr=fopen(argv[1],"r");

if(fr==NULL)

{

printf("Cannot open %s\n",argv[1]);

return(0);

Page 20: Summer Research Project (Anusaaraka) Report

20

}

do

{

if(skip()==EOF)

break;

convertToText(page);

fw=fopen(pageNumber,"w");

if(fw==NULL)

{

printf("Cannot create file %s\n",pageNumber);

return(0);

}

else

printf("File for Page Number %s created\n",pageNumber);

do

{

c=getc(fr);

if(c=='<')

{

ungetc(c,fr);

r=checkPageEnd();

if(r)

break;

//putc('<',fw);

for(i=0;temp[i]!='\0';i++)

putc(temp[i],fw);

}

else

{

putc(c,fw);

}

}while(1);

fclose(fw);

page++;

}while(c!=EOF);

fclose(fr);

return(0);

}

Page 21: Summer Research Project (Anusaaraka) Report

21

Word-Sense Disambiguation

(WSD)

In co m p ut a t io n a l l i n gu i s t i c s , w o rd -s ens e d i sambi gu a t io n (WS D ) i s an o p en p ro b l em

o f n a tu r a l l an gu age p ro ce ss i n g , w h i ch go v er ns t h e p roce s s o f i d en t i f yi n g w hich

s ens e o f a wo rd ( i . e . m ean i n g ) i s u s ed in a s en t en ce , wh en t h e wo rd h as m ul t ip l e

m ean in gs . Th e s o lu t i on to t h i s p r ob lem im p ac t s o t h e r co mp ut e r - r e l a t ed w r i t i n g ,

s u ch a s d i s cou r se , i mp r ov i n g r e l ev an ce o f s ea r ch en g i n es , an aph or a r e so lu t ion ,

co h e ren ce . A d i sam bi gu a t io n p ro ce ss r equ i r e s t wo s t r i c t t h i n gs : a d i c t i on a r y t o

s p ec i f y t h e s ens es w h ich a re t o be d i sam bi gu a t ed and a co r pu s o f l an guage d a t a t o

b e d i s am bi gu a t ed ( i n som e m et ho ds , a t r a in i n g co r pus o f l an gu age ex amp l es i s a l so

r eq u i r ed ) . WS D t a sk h as t w o v a r i an t s : " l ex i ca l s amp le " an d " a l l w o rd s " t a sk . T he

f o rm er comp r i s es d i s amb i gu a t i n g th e o ccu r r en ces o f a sm al l s am pl e o f t a r ge t wo rds

w h ich we r e p r ev i ou s l y s e l ec t e d , wh i l e i n t h e l a t t e r a l l t h e w o rd s in a p i ece o f

r u nn in g t ex t n eed to b e d i s am bi gu a t ed . T h e l a t t e r i s d eem ed a mo r e r ea l i s t i c f o rm

o f ev a l u a t i on , bu t t h e co rp us i s mo r e ex p ens iv e to p r o du ce b ecaus e h u m an

an no t a to r s hav e to r e ad t h e d e f in i t i on s f o r e ach w o rd in t h e sequ en ce ever y t i me

t h ey n eed t o m ak e a t agg i n g ju d gem en t , r a t h e r t h an o nce f o r a b lo ck o f i n s t an ces

f o r t h e s am e t a r ge t w o r d .

T o g i ve a h i n t ho w a l l t h i s w or ks , con s i d er t wo ex amp l es o f t h e d i s t i n c t

s ens e s t h a t ex i s t f o r t h e ( w r i t t en ) w ord " ba ss " :

a t yp e o f f i s h

t on e s o f l ow f requ en c y

an d th e s en t en ces :

I wen t f i sh i ng f or so m e s ea b as s .

T h e bas s l i n e o f t he so ng i s t oo w ea k .

Page 22: Summer Research Project (Anusaaraka) Report

22

T o a hu m an , i t i s ob v i ou s th a t t h e f i r s t s en t en ce i s u s in g t he wo r d " bass

( f i sh ) " , a s i n t h e fo r mer s ens e abo v e an d i n t h e secon d s en ten ce , t h e w o rd " b ass

( i n s t ru m en t ) " i s b e i n g us ed as i n t h e l a t t e r s en s e b e l o w. D ev e l op in g a l go r i t hm s to

r ep l i c a t e t h i s h um an ab i l i t y c an o f t en b e a d i f f i cu l t t a sk , a s i s fu r t h e r ex emp l i f i ed

b y t h e im pl i c i t eq u i vo ca t io n b e t ween " b ass ( s o un d) " an d " b a ss " (m us ica l

i n s t r um en t ) .

C Language Integrated Production System :

C LIP S i s an ex pe r t s ys t em to o l o r i g i n a l l y d ev e l op ed by t h e S o f t w a r e

T echn o l o g y Br an ch ( S T B) , N AS A/ Lyn d o n B . J oh ns on Space C en t e r . S in ce i t s f i r s t

r e l e as e in 19 86 , CLIP S h a s u nd e r go ne co n t in ua l r e f in em en t and imp r ov em ent . I t i s

n o w u s ed b y t h ou san ds o f p eop l e a r ou n d th e w o r ld . C LI PS i s d es i gn ed t o fa c i l i t a t e

t h e d ev e lo pm ent o f s o f t w ar e to mo de l hu m an kn ow l edge o r ex p e r t i s e . Th e r e a re

t h ree wa ys t o r ep r es en t k no wl ed ge in C LIP S :

• Rul e s , w h i ch a r e p r im a r i l y i n t end ed f o r h eur i s t i c k n ow led ge b as ed on ex p er i en ce .

• D ef f un c t io ns and g en er i c f un c t io ns , w h i ch a r e p r i ma r i l y i n t end ed fo r p roced u ra l

k n ow led ge .

• Ob j ec t -o r i en t ed pr og ra m mi ng , a l so p r im a r i l y i n t en d ed fo r p r o ced ur a l kn o wl ed ge .

T h e f iv e gen e r a l l y a ccep t ed f ea tu r e s o f o b j ec t -o r i en t ed p ro gr ammin g a r e

s up po r t ed : c l a s s es , m ess age - h an d l e r s , ab s t r ac t i on , encap su l a t io n , i n h er i t an ce ,

p o l ym o r ph i sm . Ru l e s m ay p a t t e rn m atch o n ob j ec t s an d fac t s .

W e can d ev e lo p so f t w a re us i n g on l y r u l es , on l y o b j ec t s , o r a mix tu r e o f

o b j ec t s an d ru l e s . C LIP S h a s a l so b een d e s i gn ed fo r i n t eg r a t i on wi t h o t h er

l an gu ages su ch as C and J av a . R u l e s and ob jec t s fo rm an in t eg r a t ed s ys t em t oo

s in ce r u l es c an p a t t e rn - ma t ch o n f ac t s an d ob j ec t s . In ad d i t i o n to b e i n g used a s a

s t an d - a lo ne t o o l , C LIP S can b e ca l l ed f ro m a p r o ced ur a l l an gu age , pe r fo r m i t s

f u n c t i on , an d t h en r e tu r n co n t ro l b ack t o t h e ca l l i n g p ro gr am. Li k ew is e , p r oced ura l

co d e can b e d e f i ned a s ex t e rn a l fun c t io ns and ca l l ed f r om C LIP S . When t he

ex t e r n a l cod e com pl e t e s ex ecu t i on , con t ro l r e tu r ns t o C LIP S . C LIP S i s an ex ce l l en t

t oo l fo r w o r d -s ens e d i s amb i gu a t i on .

Page 23: Summer Research Project (Anusaaraka) Report

23

Conclusion

M T i s r e l a t i v e l y n ew i n In d i a – abo u t a d ecad e o l d . In co m p ar i s on wi th MT e f f o r t s

i n Eu ro p e an d J apan , wh ich a r e a t l ea s t 3 d ecad es o l d , i t w ou ld s eem t h a t In d i an

M T h as a l o n g wa y to go . Ho w ev e r , t h i s c an a l s o b e an ad v an tage , b ecaus e In d i an

r e s ea r ch e rs c an l e a r n f r om t he ex p e r i en ce o f t h e i r g lo ba l co un t e r pa r t s . The r e a re

c l os e to a d oz en p ro j ec t s n ow , wi th abo u t 6 o f t h em b e i ng i n ad v an ced p r o to t yp e o r

t e chn o lo g y t r an s fe r s t age , and t h e r es t h av in g b een n ewl y in i t i a t ed .

T h e In d i an N LP / MT scen e so f a r h as been ch ar ac t e r i z ed b y an acu t e

s ca r c i t y o f b a s i c l ex i ca l r es ou r ces s uch a s co rp o ra , MRD s , l ex i co ns , t h es au r i and

t e rm in o lo g y b an k s . A l s o , t he v a r io us M T gr o up s h av e u s ed d i f f e r en t f o rm al i sms

b e s t s u i t e t o t h e i r s p ec i f i c a pp l i c a t i on s , an d h en ce t h e re h as b een l i t t l e sh a r in g o f

r e s ou r ces am on g t h em. T hes e i s sue s a r e b e in g add r e s s ed no w . T he r e a re

go v e r nm en t a l a s we l l a s vo lu n t a r y e f f o r t s un de r w a y t o d ev e l op com mo n l ex i ca l

r e s ou r ces , and to c r ea t e fo r ums f o r co ns o l i d a t i n g an d co o rd i n a t i n g N LP an d MT

e f f o r t s . I t app ea rs t h a t t h e ex p lo r a to r y p h as e o f In d i an MT i s ov e r , an d th e

co ns o l i d a t i on p h ase i s abo u t t o b egi n , w i th t h e fo cus m ov i n g f r om p ro o f -o f -

co n cep t p r o t o t yp e s t o p ro du c t i on iza t i on , d ep lo ym en t , co l l ab o r a t i v e re s ou rce

s h a r in g and ev a l u a t i on .

T h e co r e An us aa rak a ou tp u t i s i n a l an gu age c l ose t o t he t a r ge t

l an gu age , and can b e u nd e rs t oo d b y t h e h u m an r ead er a f t e r s om e t r a i n in g . T he

q u es t io n i s ho w m u ch t r a i n i n g i s n ece s s a r y t o ge t a v e r y h i gh d egr ee o f

co mp r eh ens i on . O ur e x p e r i ence o f w or k in g am on g In d i an l an gu ages sh o ws th a t t h i s

t r a i n i n g i s l i k e l y t o b e sm al l . R e as on f o r t h i s i s t h a t In d i a f o rms a l i n gu i s t i c a r e a :

In d i an l an gu ages sh a r e v o cab u l a r y an d g r amm at i ca l cons t ru c t i on s . Th e r e a r e a l so

s h a r ed p r agm at i c s an d cu l t u r e . S i mi l a r ap p r o ach can b e ap p l i ed t o b u i ld Eng l i s h to

H i nd i An us aa r ak a . A s tu d y can b e co n du c t ed r e l a t ed to t r a i n i n g req u i red to r ead

s u ch an ou t pu t . The ex p ec ta t i on i s t ha t E n g l i s h to Hi nd i u s ab l e s ys t em can b e bu i l t

ex cep t t h a t i t w i l l r eq u i r e l o n ge r t r a in in g .