sudeshna sarkar iit kharagpur...the ultimate multilingual search system “given a query in any...

55
Multilingual Digital Libraries Multilingual and Crosslingual Retrieval and Access Sudeshna Sarkar IIT Kharagpur UNESCO-NDL INDIA INTERNATIONAL WORKSHOP ON KNOWLEDGE ENGINEERING FOR DIGITAL LIBRARY DESIGN

Upload: others

Post on 09-Jul-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

  • Multilingual Digital Libraries Multilingual and Crosslingual Retrieval and Access

    Sudeshna Sarkar

    IIT Kharagpur

    UNESCO-NDL INDIA INTERNATIONAL WORKSHOP ON KNOWLEDGE ENGINEERING FOR DIGITAL LIBRARY DESIGN

  • Asd Asd Asd Asd Ad As ad

    Many languages

    World

    • 6909 living languages

    India

    • India has 122 major languages (22 are constitutionally recognized) and 2371 dialects (Census 2001)

    • Multiple scripts

    • Population of literates

    – 20% of India understand English

    – 80% cannot

    • Rich collection of books and materials in different Indian language

    • The Divide

    • The availability of resources, training

    data, and benchmarks in English

    leads to a disproportionate focus on

    a few languages and a neglect of

    many popular languages

  • Multilingualism and Universal Access • Promotion and Use of

    Multilingualism

    • Promotion of linguistic and cultural diversity

    • Universal access to cyberspace

    • Not all languages are equally represented. The universal digital library should ideally bridge this gap letting users find objects in languages different from their native one.

  • Diversity of languages in India

    Kashmiri: 0.54

    Nepali: 0.28 Sindhi: 0.25

    Dogri: 0.22

    Non-Scheduled

    languages: 3.44

    Konkani: 0.24

    Manipuri: 0.14 Bodo: 0.13

    Sanskrit: Negligable

    Punjabi: 2.83

    Assamese: 1.28

    Oriya: 3.21

    Malayalam: 3.21

    Kannada: 3.69

    Gujarati: 4.48

    Urdu: 5.01

    Tamil: 5.91

    Marathi: 6.99 Telugu: 7.19 Bengali: 8.11

    Santali: 0.63

    Maithili: 1.18

    Hindi: 41.03

    Motto of NDL: Inclusivity Reach every Indian

  • Roles of Language in Digital Libraries

    • Language associated with objects

    – Metadata

    – Content

    • Language of interactions

    – Such as query

    • Interface language

    • A multilingual digital library is a digital library that has all functions implemented simultaneously in as many languages as desired and whose search and retrieve functions are language independent.

    • Features – The user can choose the language for the

    interfaces – The user can choose the interaction

    language – The user can access metadata and

    content in any desired language

  • Multilingual Search and Access

    • Cross-Language Information Retrieval (CLIR)

    – Querying multilingual collections in one language in order to retrieve documents in another language

    • Multilingual Information Retrieval (MLIR)

    – Process information (queries, documents, both) in multiple languages

    • Multilingual Information Access (MLIA)

    – Query, retrieval and presentation of information in any language

    Documents

    (TL)

    CLIR Query

    (SL)

    Results (TL)

    Documents

    (L2)

    MLIR Results

    Documents

    (L1 )

    Documents

    (L43 )

    Query (𝐿𝑖)

  • The ultimate multilingual search system

    “Given a query in any medium and any language, select relevant items from a multilingual multimedia collection which can be in any medium and any language, and present them in the style or order most likely to be useful to the querier, with identical or near identical objects in different media or languages appropriately identified.”

    Douglas W. Oard and David Hull, AAAI Symposium on Cross-Language IR, Spring 1997, Stanford, USA

  • Enabling Multilinguality of Data Data = Content plus metadata

    • Language Identification

    – Associate language tag with data.

    – Automatic language identification.

    • Multilingual Indexing and Presentation

    – controlled vocabularies for metadata fields such as subject with different values for different languages

    – Mapping of words across languages

    – Automatic translation of metadata fields such as title and abstract

    – Automatic translation of content

  • Enabling Effective User Interactions

    • Enable querying in different languages

    • Assist in query formation, query translation and query selection

    • User feedback on result examination

    • Assistance in Browsing and Data Translation

  • Language Technology

    enabling Multilingual Digital Library

  • Normalization and spelling variation

    • अगँरेजी, अगँरेजी, अगेँजी, अगेँजी, अगंरेजी, अगंरेजी, अगेंजी, अगेंजी

    • अनतरराषरीय, अनतरराषरीय, अनतराररषरय, अनतरारषरीय, अतंरराषरीय, अतंरारषरीय, अतंरराषरीय, अनतराि षरय, अनतराषरीय

    • পাখি, পািী • কে াা, কো

    Stemming and Lemmatization

    • Stemming: different morphological variations of the same word map to the same canonical form - Okay for monolingual search

    • Lemmatization: Map words to their root dictionary form Needed for effective cross-lingual and multi-lingual search

    Tokenization, Script rendering

    Indexing, Query formation, retrieval, ranking

  • Multilingual Digital Library A Reality?

    • The Divide Resource Rich Languages – Lots of resources – Annotated Resources Resource Poor Languages – Zero or less resource – Annotated resource difficult to

    create

    • MT is not yet good enough to translate large texts and for specialized technical domains

    • Technology and NLP has made huge strides! • High quality tools developed that have pushed the accuracy of MT systems.

    • Tremendous growth in data and resources.

  • Neural Networks in NLP

    • Neural Network and Deep Learning State of the art in many NLP problems

    – Entity and Relation finding

    – Parsing

    – Machine Translation

    – Entity linking

    • Transfer learning from resource rich to resource poor languages. Cross-lingual transfer learning enablers

    – Shared representation

    – Bilingual and multilingual resources

  • What is required for building Multilingual Libraries

    Retrieval

    • Indexing objects in multiple languages

    • Enabling access through query in any language

    • Challenges: dictionary coverage, synonymy and polysemy, OOV words

    • Language independent indexing

    Access

    • Enable access in language of familiarity of user

    – Machine Translation

  • Representation: Word Embedding

    • Continuous vector space representation

    • Represented as dense vector 𝑥𝑖 ∈ 𝑅𝑑 to

    each word.

    • Learn these embeddings in a way that similar words are nearby in that space.

    • Capture the regularities and relationship between words.

    A key secret sauce for the success of many NLP systems across tasks

    Distributed Representations of Words and Phrases and their Compositionality Mikolov etal, NIPS 2013

  • Multilingual Embeddings

    16

    children enfants

    money argent

    loi law

    life vie

    monde world

    pays country

    war guerre

    peace paix energy

    energie

    market marche

    • Learn a shared embedding space between words in all languages.

    • Many benefits:

    – Transfer learning

    – Cross lingual information retrieval

    – MT

    – Parallel corpus extraction

    Mikolov etal, 2013; Faruqui & Dyer, 2014; etc

  • Multilingual word embedding • Resources:

    – Word aligned data

    – Sentence aligned data

    – Dictionary

    – Document Aligned Corpora

    Methods:

    • Monolingual mapping

    – Train monolingual word embeddings and learn linear mapping or CCA

    • Pseudo-cross-lingual:

    – Train on pseudo-cross-lingual corpus by mixing contexts of different languages

    • Cross lingual training

    – Train on parallel corpus

    • Joint Optimization

    – jointly optimise a combination of monolingual and cross-lingual losses.

  • Learning from Parallel Corpora

    • Uses monolingual datasets to learn monolingual features. • Sampled bag-of-words for each parallel sentence as the crosslingual objective. • Learns representations to model individual languages well. • Encourages representations to be similar for words that are related across two languages.

  • Linear Projection

    Learn monolingual embeddings Project to a common space using a dictionary

    𝑚𝑖𝑛𝑊 𝑊ℎ𝑖 − 𝑒𝑖

    2

    𝑛

    𝑖=1

    Final embedding of hi = Whi

    1Mikolov et.al., Exploiting Similarities among Languages for Machine Translation

  • Cross-lingual word vector projection using CCA

    • ∑′ ⊂ ∑ , ∑′ ∈ ℝ𝑛×𝑑1 Ω′ ⊂Ω,Ω ∈ ℝ𝑛×𝑑2 of words that are dictionary translations.

    • Compute Canonical Co-relation Analysis (CCA) on ∑’ and Ω’

    • CCA projects these vectors to a third space where they are maximally co-related

  • Learning Bilingual Word Embeddings

    • Input: – 𝑒𝑆𝐿, 𝑒𝑇𝐿 (source language

    embeddings)

    – 𝑒𝑇𝐿 (target language embeddings)

    – D (seed dictionary)

    • Repeat – W←LearnMapping (𝑒𝑆𝐿, 𝑒𝑇𝐿, D)

    – D← LearnDictionary (𝑒𝑆𝐿, 𝑒𝑇𝐿, D)

    • Until stopping criteria

    Monollingual embedding

    Dictionary

    Dictionary

    Dictionary

    Dictionary Mapping

    Mapping

    Mapping

    Learning bilingual word embeddings with (almost) no bilingual data, by Mikel Artetxe Gorka Labaka Eneko Agirre, ACL 2017

  • Learning from Document-Aligned Corpora

    English-Hindi-Bengali Document Aligned articles from Wikipedia

    • Merge the documents to get a “pseudo-trilingual document" • Remove the sentence boundaries • Randomly shuffle the pseudo-trilingual document. • Intuition : Each word w, regardless of its actual language, obtains word

    collocates from both vocabularies. • Train word2vec (skip-gram model) on this document

    3Vulic. et. al. , SIGIR 2015

  • Sample Results

  • How to incorporate embeddings for IR

    1. Expand query using

    embeddings (followed by non-

    neural IR) - Add words similar to

    the query

    2. IR models that work in the

    embedding space - Centroid

    distance, word mover’s distance

    • Data – Explicit relevance

    judgments

    – Implicit user behaviour – e.g., click through data

  • Application to Multilingual IR

    4Bhattacharya et. al., CICLing, 2016

    • For OOV words

    • कैं सर - cancer, disease • स्पीकर - speaker, parliament

    • Example Query Translations • भारतीय ससंद में आतंकवादी हमला -

    Indian Parliament constitutional terrorist assault

    • आईफ़ोन आईपैड डडज़ाइन लोकप्रियता लांच - iPhone iPad popularity unveiled

    Representation learning based models can estimate relevance based on semantic matches with query

  • Multilingual Clustering using Word Embeddings

    • Construct a graph G = (V, E) • V = vertices, words from the languages • E = edges, exist if the similarity between the

    vertices is above a particular threshold • Edges are weighted as cosine similarity value

    पॉललथीन

    फें कना

    जूता

    कचरों hurl

    throw

    dart bags

    सफाईकमी

    झपटा wash

    buckets

    गमले splash

    toilet

    • Use Louvain [Blondel et al., 2008] for cluster detection • Performs hard clustering • O(nlogn)

    Cluster Formation

  • Application to Cross-Language IR Original Query Q in

    source language q1q2q3 …. qn

    Map qi to cluster Ck

    Pick target language words

    from Ck

    Query translated to target language

    flag

    emblem president

    ध्वज झंडा वन्दे

    मातरम secretary

    CEO

    चेयरमैन

    सीईओ chairperson

    अध्यक्ष

    পতাো প্রতীে

    বন্দে

    currency economics

    inflation

    মুদ্রাস্ফীখত

    অর্থনীখত

    आर्थिक मुनाफा prices

    money

    cost

    টাো

    অর্থ

    িরচ

  • Results of the Clustering Approach

    • “Euro” is related to sports and not economics. • No Cluster method wrongly predicts the context and suggests words like ‘banknotes”. • On the other hand, pairwise clustering understands that “cup” is related to some sports, “football” to be more

    specific. • Multilingual clustering restricts to a shorter query and hence translates to only “trophy” and “cup”

  • From Word Embedding to Query/Document Embedding

    • Bag of embedded words: sum or average of word vectors

    – Effective only for short text

  • DSSM: Deep Structured Semantic Model

    • or Deep Semantic Similarity Model.

    • Represent query q and document d in a continuous semantic space

    • model semantic similarity between g and d using cosine similarity.

    • Force the representations – For relevant (q,d+) pairs to be close in the latent space

    – Irrelevant (q, d-) pairs to be far in the latent vector space

  • Deep Structured Semantic Model (DSSM) [Huang et al., 2013]

  • Role of Knowledge Graphs in Information Retrieval

    • Knowledge graphs for semantic search – Entities, attributes, types, relations, etc.

    • Popular (semi)structured data sources – Dbpdia – Freebase – Yago – Conceptnet – Knowledge graphs of Google, Microsoft

    • Use graph structure for relevance computation Utilizing Knowledge Bases in Text-centric Information Retrieval – Tutorial by Laura Dietz, Alexander Kotov, and Edgar Meij. In Proceedings of the Conference on Web Search and Data Mining (WSDM). 2017

  • Multilingual Knowledge Graph

    • Knowledge Graph connects words and phrases of Natural Language with labeled edges.

    • Knowledge Graph combined with Word Embedding

    • [Speer17] use ConceptNet to build semantic spaces that are more effective than distributional semantics alone.

    ConceptNet 5.5: An Open Multilingual Graph of General Knowledge Robert Speer, Joshua Chin, Catherine Havasi, AAAI 2017

  • in desired language

    Access

  • Access

    • Translation from language of object to desired language

    • Shortcomings of MT – Great progress in sentence level MT

    – Does not capture discourse level translation

    – May not work on specific domain verticals not represented in g training set

    – Less resourced language pairs are at a disadvantage

  • Neural Machine Translation

    • An encoder processes the source sentence and creates a compact representation

    • This representation is the input to the decoder which generates a sequence in the target language

    • Both encoder and decoder are RNNs

    Encoder Decoder Model

    Source sentence

    Target sentence Encoder

    Decoder

    • End-to-end training: All parameters are simultaneously optimized

    • Distributed representations share strength

    • Better exploitation of context

  • .

    .

    .

    .

    .

    .

    . . . .

    . . . .

    . . . .

    . . . .

    . . . .

    . . . .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    Annotation vector ℎ𝑗= [ℎ𝑗 , ℎ𝑗]

    A set of annotation vectors {ℎ1, ℎ2, … , ℎ𝑇} For each target word 𝑦𝑡,

    1. Compute alignment score

    e 𝛼𝑡,𝑗 = 𝑓 𝑦𝑡−1, ℎ𝑗 , 𝑠𝑡−1

    2. Get a context vector 𝑐𝑡 = ∑ 𝛼𝑡,𝑗ℎ𝑗𝑗

    αt,T

    h1 h1

    x 1

    αt,1

    αt,2 h2 h2

    x 2

    αt,3 h3 h3

    x 3

    hT hT

    x T

    y t-1

    st-1

    +

    y t

    st

    Attention Model for MT: Learning to Align and Translate jointly

  • NMT

  • Continuous Vector Space

    • Similar sentences are close in this space

    • Multiple dimensions of similarity encoded.

  • Multilingual Machine Translation

    • Google's Multilingual Neural Machine Translation System: Enabling Zero-Shot Translation; by Melvin Johnson, Mike Schuster, Quoc V. Le, Maxim Krikun, Yonghui Wu, Zhifeng Chen, Nikhil Thorat, Fernanda Viégas, Martin Wattenberg, Greg Corrado, Macduff Hughes, Jeffrey Dean; Transactions of the Association of Computational Linguistics – Volume 5, Issue 1, Oct 2017

    • Multi-Way, Multilingual Neural Machine Translation; by Firat, Kyunghyun Cho, B sankaran, FTY Vural, Yoshua Bengio, Journal; Computer Speech and Language, Volume 45 Issue C, September 2017

  • GNMT

  • utilize more data sources

    • Multi-lingual: learn from many language pairs?

    • SMT-inspired: utilize monolingual data?

    • Multi-task: combine seq2seq tasks?

  • Multilingual Translation

    • Number of parameters grows linearly w.r.t. number of languages

    • Multi-source translation

  • Multilingual Translation with Shared Alignment • One Encoder per source

    language • One decoder per target

    language • Shared Attention Mechanism

    – Target hidden state, source context vector

    → Attention weight

    • No multi-way parallel corpus assumed – Bilingual sentence pairs only – Each sentence pair activates/updates one encoder, decoder and shared attention

  • Zero-Shot Translation with Google’s Multilingual Neural Machine Translation System

    • Google Neural Machine Translation (GNMT), an end-to-end learning framework that learns from millions of examples – Sep 2016

    • GNMT extended allowing for a single system to translate between multiple languages.

    • uses an additional “token” at the beginning of the input sentence to specify the required target language to translate to.

    • In addition to improving translation quality, our method also enables “Zero-Shot Translation” — translation between language pairs never seen explicitly by the system.

  • Zero-shot

  • One model to learn them all

    • Multi-modal, multi-task: Text, speech, image… all converging to a common paradigm.

    • If you know how to build a neural MT system, you may easily learn how to build a speech-to-text recognition system...

    • Or you may train them together to achieve zero-shot AI.

    – Translation without any direct parallel resource

  • Multimodal

  • Multiple Encoder/Decoder Framework

    • Use several encoders and decoders – different language pairs – other Seq2Seq tasks

    (speech) – sentence classification

    tasks (sequence-to-category)

    – image captioning (image-to-sequence)

    • Force the representation to be identical for all encoders

  • Multilingual systems are currently used to serve 10 of the recently launched 16 language pairs in Google Translate

  • Applications of Multilingual Embeddings: Mine for Parallel Data

    • Extract parallel data from large monolingual collections

    • Approach

    – Embed billions of sentences in same space

    – For each sentence in one language, search the k-closest ones in another language

    – Decide which sentences are possible translations based on distance: simple threshold, classifier

    • NMT Marathon project

    – Multilingual embeddings for sentence-level parallel corpora

    – Open and highly-scalable implementation customized to

    – retrieve nearby sentences

    – Target: 17PB of WEB crawl data (Internet Archive)

    – Start with 55 TB of CommonCrawl

  • The Way Forward

    • Multilingual Digital Libraries are required to enable Universal Access

    • Need for high-quality, high-coverage, robust Language Technologies: translation, text mining, interfaces for Indian languages

    • Scarcity of resources for many languages.

    • CLIR/MLIA performance depends on the availability of high-quality translation resources and language processing tools

    • Find ways to acquire, maintain and update language tools and resources a necessity

  • The Way Forward

    • Creation of Multilingual knowledge bases and knowledge graphs

    • Semantic Web, ontologies, linked data, interoperability

  • A note of Caution

    • Managing Expectations for Automatic Processing

  • Thank You !!