search in transliterated space

17
Search in Transliterated Space Shared Task Proposal, FIRE 2012 Monojit Choudhury Microsoft Research Lab India

Upload: yovela

Post on 24-Feb-2016

38 views

Category:

Documents


0 download

DESCRIPTION

Shared Task Proposal, FIRE 2012. Search in Transliterated Space. Monojit Choudhury Microsoft Research Lab India. A Transliterated World Wide Web. Song Lyrics. A Transliterated World Wide Web. Reviews and Forums. A Transliterated World Wide Web. Facebook and Twitter. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Search in Transliterated Space

Search in Transliterated Space

Shared Task Proposal, FIRE 2012

Monojit ChoudhuryMicrosoft Research Lab India

Page 2: Search in Transliterated Space

A Transliterated World Wide Web

Song Lyrics

Page 3: Search in Transliterated Space

A Transliterated World Wide Web

Reviews and Forums

Page 4: Search in Transliterated Space

A Transliterated World Wide Web

Facebook and Twitter

Page 5: Search in Transliterated Space

A Transliterated World Wide Web

And lot more

Page 6: Search in Transliterated Space

Beyond Indic languages

Many languages that use non-Roman script Arabic (Saudi Arabia, UAE, Egypt,

Morocco,…) Persian Indian sub-continental languages (IL &

Dzongkha, Nepalese, Sinhala) Thai, Vietnamese Cyrillic (Russian, Ukrainian) Chinese, Japanese, Korean (rare)

Page 7: Search in Transliterated Space

Aspects of Transliterated Text

Code Mixing

Transliteration

Errors, Contracti

on

Page 8: Search in Transliterated Space

IR Scenario - I

Mono-script Monolingual IR in transliterated space Query: thandee hava yeh chandni

suhanee Results: Only Roman transliterated

documents

Challenge: Spelling variations tandee hawa ye chandny soohaany

Page 9: Search in Transliterated Space

IR Scenario - II

Cross-script and Multi-script Monolingual IR in transliterated space

Query: thandee hava yeh chandni OR ठंडी हवा ये चाँदनी Results: Both Roman transliterated

or in native script Challenge: Transliteration

Page 10: Search in Transliterated Space

Scenario - III

Cross-script and Cross-lingual IR Query: death of mareech and subahoo Document: Hindi (Transliterated and

Devanagari) and English documents

Page 11: Search in Transliterated Space

Shared Task on Retrieval

Mono-scriptMonolingual

IRTransliterate

d query in Roman

Transliterated documents in Roman

Cross-scriptMonolingual

IRTransliterate

d query in Roman

Transliterated documents in native scriptMulti-script

Monolingual IR

Query in Roman or

native script

Documents in Roman and native scripts

Page 12: Search in Transliterated Space

Shared Sub-Tasks

Language identification of transliterated queries, documents, code-mixed text

kooda kazhikkan oru urgan split pea soup undaki ML ML ML ML EN EN EN ML

Transliteration Forward: കഴിക്കാന്‍ kazhikkan Backward: kazhikkan കഴിക്കാന്‍

Page 13: Search in Transliterated Space

Available Data

20000 word pairs each in Bengali, Telugu, and Hindi (labeled with language tags)

35000 unique Hindi-Roman word pairs obtained from aligning Bollywood song lyrics

More data under preparation from FaceBook on mixture of various languages.

Looking for partners to extend!

Page 14: Search in Transliterated Space

Available Data

Currently we have 500 query and url-rel judged pairs for Bollywood song lyrics

Looking for partners to extend it to other (Indian) Languages

Other domains?

Page 15: Search in Transliterated Space

Thank you! [email protected]

Page 16: Search in Transliterated Space

Other resources

Lexicons Pronunciation lexicons G2P for some languages Stemmers and morphological

analyzers

Anything else?

Page 17: Search in Transliterated Space

Concluding Remarks

We have built Multi-script Bollywood Song Search and working on transliteration and code-mixing

These are just some initial ideas that came up from our experiences

If you are interested please let me know