search in transliterated space

Post on 24-Feb-2016

38 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

Shared Task Proposal, FIRE 2012. Search in Transliterated Space. Monojit Choudhury Microsoft Research Lab India. A Transliterated World Wide Web. Song Lyrics. A Transliterated World Wide Web. Reviews and Forums. A Transliterated World Wide Web. Facebook and Twitter. - PowerPoint PPT Presentation

TRANSCRIPT

Search in Transliterated Space

Shared Task Proposal, FIRE 2012

Monojit ChoudhuryMicrosoft Research Lab India

A Transliterated World Wide Web

Song Lyrics

A Transliterated World Wide Web

Reviews and Forums

A Transliterated World Wide Web

Facebook and Twitter

A Transliterated World Wide Web

And lot more

Beyond Indic languages

Many languages that use non-Roman script Arabic (Saudi Arabia, UAE, Egypt,

Morocco,…) Persian Indian sub-continental languages (IL &

Dzongkha, Nepalese, Sinhala) Thai, Vietnamese Cyrillic (Russian, Ukrainian) Chinese, Japanese, Korean (rare)

Aspects of Transliterated Text

Code Mixing

Transliteration

Errors, Contracti

on

IR Scenario - I

Mono-script Monolingual IR in transliterated space Query: thandee hava yeh chandni

suhanee Results: Only Roman transliterated

documents

Challenge: Spelling variations tandee hawa ye chandny soohaany

IR Scenario - II

Cross-script and Multi-script Monolingual IR in transliterated space

Query: thandee hava yeh chandni OR ठंडी हवा ये चाँदनी Results: Both Roman transliterated

or in native script Challenge: Transliteration

Scenario - III

Cross-script and Cross-lingual IR Query: death of mareech and subahoo Document: Hindi (Transliterated and

Devanagari) and English documents

Shared Task on Retrieval

Mono-scriptMonolingual

IRTransliterate

d query in Roman

Transliterated documents in Roman

Cross-scriptMonolingual

IRTransliterate

d query in Roman

Transliterated documents in native scriptMulti-script

Monolingual IR

Query in Roman or

native script

Documents in Roman and native scripts

Shared Sub-Tasks

Language identification of transliterated queries, documents, code-mixed text

kooda kazhikkan oru urgan split pea soup undaki ML ML ML ML EN EN EN ML

Transliteration Forward: കഴിക്കാന്‍ kazhikkan Backward: kazhikkan കഴിക്കാന്‍

Available Data

20000 word pairs each in Bengali, Telugu, and Hindi (labeled with language tags)

35000 unique Hindi-Roman word pairs obtained from aligning Bollywood song lyrics

More data under preparation from FaceBook on mixture of various languages.

Looking for partners to extend!

Available Data

Currently we have 500 query and url-rel judged pairs for Bollywood song lyrics

Looking for partners to extend it to other (Indian) Languages

Other domains?

Thank you! monojitc@microsoft.com

Other resources

Lexicons Pronunciation lexicons G2P for some languages Stemmers and morphological

analyzers

Anything else?

Concluding Remarks

We have built Multi-script Bollywood Song Search and working on transliteration and code-mixing

These are just some initial ideas that came up from our experiences

If you are interested please let me know

top related