resources for sanskrit and other indian languages-- dr girish nath jha

42
Keynote delivered at WAVES2012, UMASSD, July14,2012 Current Progress in Developing Resources for Sanskrit and other Indian languages Girish Nath Jha Associate Professor, Computational Linguistics Special Center for Sanskrit Studies, J.N.U., New Delhi 110067 & Mukesh and Priti Chatter Distinguished Professor of History of Science, University of Massachusetts Dartmouth Special Centre for Sanskrit Studies, J.N.U., New Delhi

Upload: swami-jyoti

Post on 25-Mar-2016

235 views

Category:

Documents


2 download

DESCRIPTION

Current Progress in developing Resources for Sanskrit and other Indian Languages-- Dr Girish Nath Jha

TRANSCRIPT

Page 1: Resources for Sanskrit and other Indian Languages-- Dr Girish Nath Jha

Keynote delivered at WAVES2012, UMASSD,

July14,2012

Current Progress in Developing

Resources for Sanskrit and other Indian

languages

Girish Nath Jha

Associate Professor, Computational Linguistics

Special Center for Sanskrit Studies, J.N.U., New Delhi – 110067

&

Mukesh and Priti Chatter Distinguished Professor of History of Science,

University of Massachusetts Dartmouth

Special Centre for Sanskrit Studies, J.N.U., New Delhi

Page 2: Resources for Sanskrit and other Indian Languages-- Dr Girish Nath Jha

Keynote delivered at WAVES2012, UMASSD,

July14,2012

What is a “resource” ?

Language data, corpora in standard formats for computer processing for direct/indirect use by humans

India is considered “resource-poor” country as we do not have enough standard resources.

Special Centre for Sanskrit Studies, J.N.U., New Delhi

Page 3: Resources for Sanskrit and other Indian Languages-- Dr Girish Nath Jha

Keynote delivered at WAVES2012, UMASSD,

July14,2012

What does it mean for Sanskrit ?

-electronic texts, dictionaries

-digital libraries

-parallel corpora

-search engines

-language processing tools (MT, Speech, OCR, OLHWR etc)

-second Indology revolution in the making?

Special Centre for Sanskrit Studies, J.N.U., New Delhi

Page 4: Resources for Sanskrit and other Indian Languages-- Dr Girish Nath Jha

Keynote delivered at WAVES2012, UMASSD,

July14,2012

Why Sanskrit?

Special Centre for Sanskrit Studies, J.N.U., New Delhi

Page 5: Resources for Sanskrit and other Indian Languages-- Dr Girish Nath Jha

Keynote delivered at WAVES2012, UMASSD,

July14,2012

Special Centre for Sanskrit Studies, J.N.U., New Delhi

Language Scripts Family Hindi Devanagari Indo Aryan Sanskrit Devanagari Indo Aryan Marathi Devanagari Indo Aryan Konkani Devanagari Indo Aryan Maithili Devanagari Indo Aryan Nepali Devanagari Indo Aryan Sindhi Devanagari Indo Aryan Bodo Devanagari Tibeto Burman Dogri Devanagari Indo Aryan Santhali Devanagari, Ol Chiki Austro Asiatic Bengali Bengali Indo Aryan Assamese Bengali Indo Aryan Manipuri Bengali, Meithei Indo Aryan Gujarati Gujarati Indo Aryan Kannada Kannada Dravidian Malayalam Malayalam Dravidian Oriya Oriya Indo Aryan Punjabi Gurumukhi Indo Aryan Tamil Tamil Dravidian Telugu Telugu Dravidian Urdu Perso-Arabic Indo Aryan Kashmiri Perso-Arabic Indo Aryan

Page 6: Resources for Sanskrit and other Indian Languages-- Dr Girish Nath Jha

Keynote delivered at WAVES2012, UMASSD,

July14,2012

Indian constitution on languages

448 articles, 12 schedules, 107 amendments (so far)

Article III – Fundamental rights

Article IV A – Fundamental duties

Article XVII – Official Language

Article XVII – Regional Languages

Article XVII – Language of Supreme Court and High Court

Article XVII – Special Directives

Special Centre for Sanskrit Studies, J.N.U., New Delhi

Page 7: Resources for Sanskrit and other Indian Languages-- Dr Girish Nath Jha

Keynote delivered at WAVES2012, UMASSD,

July14,2012

Sanskrit

Commission, 1956

Special Centre for Sanskrit Studies, J.N.U., New Delhi

Page 8: Resources for Sanskrit and other Indian Languages-- Dr Girish Nath Jha

Sanskrit in digital age

Computer for Sanskrit

Sanskrit for Computer

Page 9: Resources for Sanskrit and other Indian Languages-- Dr Girish Nath Jha

major e-contents

Sanskrit wikipedia Sanskrit wikipedia (Sanskrit medium wikipedia)

http://sa.wikipedia.org

Sanskrit wikisource (Sanskrit e-texts)

Sanskrit wiktionary (Sanskrit encyclopedia )

Sanskrit wikiBooks (Sanskrit e-library)

Page 10: Resources for Sanskrit and other Indian Languages-- Dr Girish Nath Jha

major e-contents

Digital libraries DLI project (http://dli.iiit.ac.in/) 1022 Sanskrit books

(IISc, CMU,NSF,ERNET,MCIT)

NSF funded, Brown Univ

(http://www.sanskritlibrary.org/)

Clay’s project (http://www.claysanskritlibrary.org) JJC

foundation, NYU Press

INRIA, Paris (technical texts, tools)

IGNCA (http://ignca.nic.in/sanskrit.htm _

J-TESS (JNU Text Encoding and Search for Sanskrit)

Page 11: Resources for Sanskrit and other Indian Languages-- Dr Girish Nath Jha

major e-contents

Sanskrit e-documents

Maharshi Mahesh Yogi

(http://sanskrit.safire.com/Sanskrit.html)

Avinash Sathaye - Sanskrit documents list(http://sanskritdocuments.org/ )

Srinivas Varkhedi – Sanskrit corpus (http://rsvidyapeetha.ac.in/)

Oliver Hellwig (Univ of Berlin)

Anand Mishra

(http://sanskrit.sai.uni-heidelberg.de/)

http://sanskrit.jnu.ac.in

Page 12: Resources for Sanskrit and other Indian Languages-- Dr Girish Nath Jha

major e-contents

Sanskrit documents

Tirupati Vidyapeeth

ASR Melkote

CDAC- heritage computing group

Sanskrit blogs

JNU students

Others (http://sanskritlinks.blogspot.com )

Sanskrit corpora and tagset

JNU , LDC, Univ. of Pennsylvania, U.Hyd

Page 13: Resources for Sanskrit and other Indian Languages-- Dr Girish Nath Jha

major e-contents: static

Himanshu Pota (http://learnsanskrit.wordpress.com/)

http://www.ee.adfa.edu.au/staff/hrp/personal/sanskrit/

American Sanskrit Institute

(http://www.americansanskrit.com/)

Acharya, IITM

(http://acharya.iitm.ac.in/sanskrit/tutor.php)

Vasudev Bhatt

(http://www.ourkarnataka.com/learnsanskrit/sanskrit_main

.htm)

Sanskrit Bharati (http://www.samskrita-

bharati.org/newsite/index.php)

http://sanskritbhasha.blogspot.com/

Page 14: Resources for Sanskrit and other Indian Languages-- Dr Girish Nath Jha

major e-contents: dynamic

Tutorials

Sudhir Kaicker (http://www.sanskrit-lamp.org/_

Prof. G.V.Singh (CASTLE project of DoE)

Peter Scharf

Avinash Sathaye

Sanskrit CD (Mahesh Kulkarni, CDAC Pune)

Language processing tools

Gerard Huet

Amba Kulkarni

Peter Scharf

Girish N Jha

Anand Mishra

Page 15: Resources for Sanskrit and other Indian Languages-- Dr Girish Nath Jha

Keynote delivered at WAVES2012, UMASSD,

July14,2012

Editor:

Girish Nath Jha

Special Centre for Sanskrit Studies, J.N.U., New Delhi

Page 16: Resources for Sanskrit and other Indian Languages-- Dr Girish Nath Jha

Keynote delivered at WAVES2012, UMASSD,

July14,2012

Work done at

Jawaharlal Nehru

University (JNU),

New Delhi

Special Centre for Sanskrit Studies, J.N.U., New Delhi

Page 17: Resources for Sanskrit and other Indian Languages-- Dr Girish Nath Jha

Keynote delivered at WAVES2012, UMASSD,

July14,2012

Special Center for Sanskrit Studies, JNU

Linking Traditional scholarship with modern

methods

Exploring Science & Technology in Sanskrit

Developing language technology resources and

tools for Sanskrit and other Indian languages

Collaboration with universities

Collaboration with industry

Special Centre for Sanskrit Studies, J.N.U., New Delhi

Page 18: Resources for Sanskrit and other Indian Languages-- Dr Girish Nath Jha

Keynote delivered at WAVES2012, UMASSD,

July14,2012

SATIAIT -

Science And Technology

In

Ancient Indian Texts

Special Centre for Sanskrit Studies, J.N.U., New Delhi

Page 19: Resources for Sanskrit and other Indian Languages-- Dr Girish Nath Jha

Keynote delivered at WAVES2012, UMASSD, July14,2012

Center for Indic Studies, UMASSD initiative

Due to the initiative and efforts of Prof Bal Ram Singh, we are doing the following activities - Identifying key S&T texts

Digitizing them, providing computer help

Translating

Lab experiments

Documenting…

Special Centre for Sanskrit Studies, J.N.U., New Delhi

Page 20: Resources for Sanskrit and other Indian Languages-- Dr Girish Nath Jha

Editors:

Bal Ram Singh

Girish Nath Jha

Umesh Kumar Singh

Diwakar Mishra

Keynote delivered at WAVES2012, UMASSD, July14,2012

Page 21: Resources for Sanskrit and other Indian Languages-- Dr Girish Nath Jha

7/14/2012 Special Center for Sanskrit Studies, J.N.U., New

Delhi

Editors:

Girish Nath Jha

Bal Ram Singh

R P Singh

Diwakar Mishra

Page 22: Resources for Sanskrit and other Indian Languages-- Dr Girish Nath Jha

7/14/2012 Special Center for Sanskrit Studies, J.N.U., New

Delhi

Editors:

Angela Marcantonio

Girish Nath Jha

Page 23: Resources for Sanskrit and other Indian Languages-- Dr Girish Nath Jha

Keynote delivered at WAVES2012, UMASSD, July14,2012

Technology Development for Indian Languages

Special Centre for Sanskrit Studies, J.N.U., New Delhi

Page 24: Resources for Sanskrit and other Indian Languages-- Dr Girish Nath Jha

Localization

Linguistic Resources

Standards

Certification

Software/Tools

Training Awareness

Technologies

Building Blocks of Language Technology Development

Language

Technology

Keynote delivered at WAVES2012, UMASSD,

July14,2012

Page 25: Resources for Sanskrit and other Indian Languages-- Dr Girish Nath Jha

Keynote delivered at WAVES2012, UMASSD, July14,2012

Near Future initiatives

Localization R & D Center

(JNU, CDAC, IIT Delhi)

NME-ICT center at JNU

(MHRD, JNU)

Special Centre for Sanskrit Studies, J.N.U., New Delhi

Page 26: Resources for Sanskrit and other Indian Languages-- Dr Girish Nath Jha

Keynote delivered at WAVES2012, UMASSD, July14,2012

Machine Translation

SHMT (Dept of IT, Govt. of India)

SaHiT (unfunded)

Microsoft Translator Hub

English-Hindi (Microsoft)

English-Urdu (Microsoft)

English-Gujarati (Microsoft)

Sanskrit-English (unfunded)

English-Maithili (unfunded)

Special Centre for Sanskrit Studies, J.N.U., New Delhi

Page 27: Resources for Sanskrit and other Indian Languages-- Dr Girish Nath Jha

Keynote delivered at WAVES2012, UMASSD,

July14,2012

SHMT (DIT)

A consortium of 7 universities/institutes University of Hyderabad

JNU

IIIT Hyderabad

Tirupati Vidyapeeth

Sanskrit Academy Hyderabad

Poornaprajna Vidyapeth Bangalore

Rajasthan Sanskrit University, Jaipur

Duration 3 yrs (2008 – 2012) MT system tobe hosted on http://tdil-dc.in very soon

Phase2 (2012-2015)

Special Centre for Sanskrit Studies, J.N.U., New Delhi

Page 28: Resources for Sanskrit and other Indian Languages-- Dr Girish Nath Jha

Keynote delivered at WAVES2012, UMASSD,

July14,2012

Indian Languages

Corpora Initiative

(ILCI)

Special Centre for Sanskrit Studies, J.N.U., New Delhi

Page 29: Resources for Sanskrit and other Indian Languages-- Dr Girish Nath Jha

Keynote delivered at WAVES2012, UMASSD,

July14,2012

A consortium of Indian universities has been formed

under my leadership – 17 languages, remaining 6 to

join later

Parallel tagged corpora if 100,000 sentences in all

Indian languages in tourism, health, agriculture,

entertainment domains

Funded by TDIL program of Ministry of C & IT

Phase1 :2009-12 (a consortium of 12 languages

including English) - corpora to be hosted on

http://tdil-dc.in very soon

Phase2 : 2012-2015

Special Centre for Sanskrit Studies, J.N.U., New Delhi

Page 30: Resources for Sanskrit and other Indian Languages-- Dr Girish Nath Jha

Keynote delivered at WAVES2012, UMASSD,

July14,2012

Languages & Consortia partners

Consortium of universities

Server baser corpora development and

management >> the server is called

“sanskrit”

Limited Crowd sourcing

7/14/2012

Special Centre for Sanskrit Studies, J.N.U., New Delhi

Page 31: Resources for Sanskrit and other Indian Languages-- Dr Girish Nath Jha

Keynote delivered at WAVES2012, UMASSD,

July14,2012

Shallow parsing tools for Indian

languages

Under a consortium project led by Univ.

of Hyderabad

Morph analyzers for 11 Indian languages

Duration = 2012-15

7/14/2012

Special Centre for Sanskrit Studies, J.N.U., New Delhi

Page 32: Resources for Sanskrit and other Indian Languages-- Dr Girish Nath Jha

Keynote delivered at WAVES2012, UMASSD,

July14,2012

Consultancies

Special Centre for Sanskrit Studies, J.N.U., New Delhi

Page 33: Resources for Sanskrit and other Indian Languages-- Dr Girish Nath Jha

Keynote delivered at WAVES2012, UMASSD,

July14,2012

Online Handwriting Recognition

for Devanagari based languages

-Microsoft

Special Centre for Sanskrit Studies, J.N.U., New Delhi

Page 34: Resources for Sanskrit and other Indian Languages-- Dr Girish Nath Jha

Keynote delivered at WAVES2012, UMASSD,

July14,2012

Indic languages tagset and

annotation

-Microsoft Research India

Special Centre for Sanskrit Studies, J.N.U., New Delhi

Page 35: Resources for Sanskrit and other Indian Languages-- Dr Girish Nath Jha

Keynote delivered at WAVES2012, UMASSD,

July14,2012

Multimodal data in 8 security

sensitive languages (Indian English, Hindi, Urdu, Tamil, Bangla,

Punjabi, Pushto, Dari)

-LDC, University of Pennsylvania

Special Centre for Sanskrit Studies, J.N.U., New Delhi

Page 36: Resources for Sanskrit and other Indian Languages-- Dr Girish Nath Jha

Keynote delivered at WAVES2012, UMASSD,

July14,2012

English- (major) Indian

languages Machine Translation (English-Hindi, English-Urdu, English-Gujarati,

Sanskrit-English, English-Maithili)

Started this summer

-Microsoft

Special Centre for Sanskrit Studies, J.N.U., New Delhi

Page 37: Resources for Sanskrit and other Indian Languages-- Dr Girish Nath Jha

Keynote delivered at WAVES2012, UMASSD,

July14,2012

Some of the recent R&D with

the help of research students

Special Centre for Sanskrit Studies, J.N.U., New Delhi

Page 38: Resources for Sanskrit and other Indian Languages-- Dr Girish Nath Jha

Keynote delivered at WAVES2012, UMASSD,

July14,2012

Sanskrit Speech Synthesizer (in collaboration with Microsoft Research India)

(prototype by next year)

Named Entity Recognizer for

Sanskrit (prototype finished)

Special Centre for Sanskrit Studies, J.N.U., New Delhi

Page 39: Resources for Sanskrit and other Indian Languages-- Dr Girish Nath Jha

Keynote delivered at WAVES2012, UMASSD,

July14,2012

J-TESS :

JNU Text Encoding &

Search for Sanskrit

Special Centre for Sanskrit Studies, J.N.U., New Delhi

Page 40: Resources for Sanskrit and other Indian Languages-- Dr Girish Nath Jha

Keynote delivered at WAVES2012, UMASSD,

July14,2012

Tools

Server based corpora creation,

annotation application called ILCIANN

Sanskrit and other Indian languages

processing tools

Multimedia animation, e-learning tools

Lexical resources and search

Indian language Transliterator

Special Centre for Sanskrit Studies, J.N.U., New Delhi

Page 41: Resources for Sanskrit and other Indian Languages-- Dr Girish Nath Jha

Keynote delivered at WAVES2012, UMASSD,

July14,2012

Demo

http://sanskrit.jnu.ac.in

Special Centre for Sanskrit Studies, J.N.U., New Delhi

Page 42: Resources for Sanskrit and other Indian Languages-- Dr Girish Nath Jha

ક क

ಕ കൂ क କ ਕ క

ક ಕ କ ਕ

ক क

ક గ

धन्यवाद ! questions??

[email protected]

91-11-26741308 Keynote delivered at WAVES2012, UMASSD,

July14,2012