good template

Upload: neha-khan

Post on 06-Apr-2018

223 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/2/2019 Good Template

    1/58

    1

    Cross Language Information

    Retrieval (CLIR)

    Modern Information Retrieval

    Sharif University of Technology

    Fall 2005

    Mohsen Jamali

  • 8/2/2019 Good Template

    2/58

    2

    The General Problem

    Find documents written in any language

    Using queries expressed in a single language

  • 8/2/2019 Good Template

    3/58

    3

    The General Problem (cont)

    Traditional IR identifies relevant documents in thesame language as the query (monolingual IR)

    Cross-language information retrieval (CLIR) tries

    to identify relevant documents in a languagedifferent from that of the query

    This problem is more and more acute for IR on the

    Web due to the fact that the Web is a trulymultilingual environment

  • 8/2/2019 Good Template

    4/58

    4

    Why is CLIR important?

  • 8/2/2019 Good Template

    5/58

    5

    Characteristics of the WWW

    Country of Origin of Public Web Sites, 2001 (% of Total)(OCLC Web Characterization, 2001)

  • 8/2/2019 Good Template

    6/58

    6Source: Global Reach

    5%

    9%

    6%

    5%

    5%

    4%

    4%

    3%

    2%

    2%

    3%

    52%

    8%

    8%

    5%

    3%

    21%

    2%

    5%2%

    5%

    3%

    6%

    32%

    Spanish Japanese German French

    Chinese Scandanavian Italian Dutch

    Korean Portuguese Other English

    EnglishEnglish

    2000 2005

    Global Internet User Population

    Chinese

  • 8/2/2019 Good Template

    7/58

    7

    Importance of CLIR

    CLIR research is becoming more and moreimportant for global information exchange andknowledge sharing.

    National Security

    Foreign Patent Information Access

    Medical Information Access for Patients

  • 8/2/2019 Good Template

    8/58

    8

    CLIR is Multidisciplinary

    CLIR involves researchers from thefollowing fields:

    information retrieval, natural languageprocessing, machine translation andsummarization, speech processing,document image understanding,

    human-computer interaction

  • 8/2/2019 Good Template

    9/58

    9

    User Needs

    Search a monolingual collection in a language thatthe user cannot read.

    Retrieve information from a multilingual collection

    using a query in a single language.

    Select images from a collection indexed with freetext captions in an unfamiliar language.

    Locate documents in a multilingual collection ofscanned page images.

  • 8/2/2019 Good Template

    10/58

    10

    Why Do Cross-Language IR?

    When users can read several languages

    Eliminates multiple queries

    Query in most fluent language

    Monolingual users can also benefit

    If translations can be provided

    If it suffices to know that a document exists

    If text captions are used to search for images

  • 8/2/2019 Good Template

    11/58

    11

    Language Identification

    Can be specified using metadata

    Included in HTTP and HTML

    Determined using word-scale features

    Which dictionary gets the most hits?

    Determined using subword features

    Letter n-grams in electronic and printed text

    Phoneme n-grams in speech

  • 8/2/2019 Good Template

    12/58

    12

    Language Encoding Standards

    Language (alphabet) specific native encoding:

    Chinese GB, Big5,

    Western European ISO-8859-1 (Latin1)

    Russian KOI-8, ISO-8859-5, CP-1251

    UNICODE (ISO/IEC 10646)

    UTF-8 variable-byte length UTF-16, UCS-2 fixed double-byte

  • 8/2/2019 Good Template

    13/58

    13

    CLIR Experimental System

    2 systems: SMART Information retrieval system modified to work with 11

    European languages (Danish, Dutch, English, Finnish, French,German, Italian, Norwegian, Portuguese, Spanish, Swedish)

    Generation of restricted bigrams

    Pseudo-Relevance feedback TAPIR is a language model IR system written by M. Srikanth. It

    has been adated to work with 12 different European languages(Danish, Dutch, English, Finnish, French, German, Italian,Norwegian, Portuguese, Russian, Spanish, Swedish)

    Stemming using Porters stemmer Translation using Intertran(http://www.tranexp.com:2000/InterTran)

    http://www.tranexp.com:2000/InterTranhttp://www.tranexp.com:2000/InterTran
  • 8/2/2019 Good Template

    14/58

    14

    Approaches to CLIR

  • 8/2/2019 Good Template

    15/58

    15

    Design Decisions

    What to index?

    Free text or controlled vocabulary

    What to translate?

    Queries or documents

    Where to get translation knowledge?

    Dictionary, ontology, training corpus

  • 8/2/2019 Good Template

    16/58

    16

    Term-aligned Sentence-aligned Document-aligned Unaligned

    Parallel Comparable

    Knowledge-based Corpus-based

    Controlled Vocabulary Free Text

    Cross-Language Text Retrieval

    Query Translation Document Translation

    Text Translation Vector Translation

    Ontology-based Dictionary-based

    Thesaurus-based

  • 8/2/2019 Good Template

    17/58

  • 8/2/2019 Good Template

    18/58

    18

    Controlled Vocabulary Matures

    1977 IBM STAIRS-TLS

    Large-scale commercial cross-language IR

    1978 ISO Standard 5964

    Guidelines for developing multilingual thesauri

    1984 EUROVOC thesaurus

    Now includes all 9 EC languages

    1985 ISO Standard 5964 revised

  • 8/2/2019 Good Template

    19/58

    19

    Free Text Developments

    1970, 1973 Salton

    Hand coded bilingual term lists

    1990 Latent Semantic Indexing

    1994 European multilingual IR project

    First precision/recall evaluation

    1996 SIGIR Cross-lingual IR workshop

    1998 EU/NSF digital library working group

  • 8/2/2019 Good Template

    20/58

    20

    Query vs. Document Translation

    Query translation

    Very efficient for short queries

    Not as big an advantage for relevance feedback

    Hard to resolve ambiguous query terms

    Document translation

    May be needed by the selection interface

    And supports adaptive filtering well

    Slow, but only need to do it once per document

    Poor scale-up to large numbers of languages

  • 8/2/2019 Good Template

    21/58

    21

    Document Translation Example

    Approach

    Select a single query language

    Translate every document into that language Perform monolingual retrieval

    Long documents provide enough context

    And many translation errors do not hurt retrieval

    Much of the generation effort is wasted

    And choosing a single translation can hurt

  • 8/2/2019 Good Template

    22/58

    22

    Text Translation

    One weakness of present fully automatic machinetranslation systems is that they are able to produce highquality translations only in limited domains

    Text retrieval systems are typically more tolerant ofsyntactic than semantic translation errors but thatsemantic accuracy suffers when insufficient domainknowledge is encoded into a translation system

    In fact some of the work done by a machine translationsystem could actually reduce some measures of retrieval

    effectiveness

  • 8/2/2019 Good Template

    23/58

    23

    Query Translation Example

    Select controlled vocabulary search terms

    Retrieve documents in desired language

    Form monolingual query from the documents

    Perform a monolingual free text search

    Information

    NeedThesaurus

    Controlled

    Vocabulary

    Multilingual

    Text Retrieval

    System

    Alta Vista

    FrenchQuery

    Terms

    English

    Abstracts

    EnglishWeb

    Pages

  • 8/2/2019 Good Template

    24/58

    24

    Query Translation

    Queries (E) Query

    Translation

    ChineseIR System

    MTSystem

    An English-Chinese CLIR System

    Queries (C)

    Results (C)Results (E)

    ChineseDocuments

  • 8/2/2019 Good Template

    25/58

    25

    Controlled Vocabulary

    A controlled vocabulary information retrieval system canbe very useful in the hands of a skilled searcher, but endusers often find free text searching to be more helpful.

    Experience has shown that although the domain

    knowledge that can be encoded in a thesaurus permitsexperienced users to form more precise queries casualand intermittent users have diffculty exploiting theexpressive power of a traditional query interface in exactmatch retrieval systems

    Controlled vocabulary text retrieval systems are widelyused in libraries and user needs assessment has receivedconsiderable attention from library and information scienceresearchers.

  • 8/2/2019 Good Template

    26/58

    26

    Knowledge-based Techniquesfor Free Text Searching

  • 8/2/2019 Good Template

    27/58

    27

    Knowledge Structures for IR

    Ontology Representation of concepts and relationships

    Thesaurus

    Ontology specialized for retrieval Bilingual lexicon

    Ontology specialized for machine translation

    Bilingual dictionary Ontology specialized for human translation

  • 8/2/2019 Good Template

    28/58

    28

    Machine Readable Dictionaries

    Based on printed bilingual dictionaries

    Becoming widely available

    Used to produce bilingual term lists Cross-language term mappings are accessible

    Sometimes listed in order of most common usage

    Some knowledge structure is also present

    Hard to extract and represent automatically

    The challenge is to pick the right translation

  • 8/2/2019 Good Template

    29/58

    29

    CLIR: Dictionary Based

    Problems

    Limitations of dictionaries

    Inflected word forms

    Phrases and compound words

    Lexical ambiguity

    Possible solution

    Approximate string matching

    Unconstrained Query

  • 8/2/2019 Good Template

    30/58

    30

    Unconstrained QueryTranslation

    Replace each word with every translation

    Typically 5-10 translations per word

    About 50% of monolingual effectiveness Ambiguity is a serious problem

    Example: Fly (English)

    8 word senses (e.g., to fly a flag)

    13 Spanish translations (enarbolar, ondear, )

    38 English retranslations (hoist, brandish, lift)

  • 8/2/2019 Good Template

    31/58

    31

    Linguistic

    Analysisand

    LanguageSupport

    MachineTranslation

    MultilingualSearch Engine

    User

    Interface

    French

    Query

    Spanish

    QueryChineseQuery

    EnglishQuery

    SpanishDatabase

    French

    Database

    Chinese

    Database

    English

    Database

    Merging

    Results

    EnglishUser

    Query

    Expanded

    Query

    List ofResults

    Requests

    for DocumentTranslation

  • 8/2/2019 Good Template

    32/58

    32

    Exploiting Part-of-Speech Tags

    Constrain translations by part of speech

    Noun, verb, adjective,

    Effective taggers are available

    Works well when queries are full sentences

    Short queries provide little basis for tagging

    Constrained matching can hurt monolingual IR

    Nouns in queries often match verbs in documents

  • 8/2/2019 Good Template

    33/58

    33

    Phrase Indexing

    Improves retrieval effectiveness two ways

    Phrases are less ambiguous than single words

    Idiomatic phrases translate as a single concept Three ways to identify phrases

    Semantic (e.g., appears in a dictionary)

    Syntactic (e.g., parse as a noun phrase)

    Cooccurrence (words found together often)

    Semantic phrase results are impressive

  • 8/2/2019 Good Template

    34/58

    34

    Corpus-based Techniquesfor Free Text Searching

  • 8/2/2019 Good Template

    35/58

    35

    Types of Bilingual Corpora

    Parallel corpora: translation-equivalent pairs

    Document pairs

    Sentence pairs

    Term pairs

    Comparable corpora

    Content-equivalent document pairs

    Unaligned corpora Content from the same domain

  • 8/2/2019 Good Template

    36/58

    36

    Pseudo-Relevance Feedback

    Enter query terms in French

    Find top French documents in parallel corpus

    Construct a query from English translations

    Perform a monolingual free text search

    Top ranked

    FrenchDocumentsFrench

    Text

    Retrieval

    System

    Alta Vista

    French

    QueryTerms

    EnglishTranslations

    English

    WebPages

    Parallel

    Corpus

  • 8/2/2019 Good Template

    37/58

    37

    Learning From Document Pairs

    Count how often each term occurs in each pair

    Treat each pair as a single document

    E1 E2 E3 E4 E5 S1 S2 S3 S4

    Doc 1

    Doc 2

    Doc 3

    Doc 4

    Doc 5

    4 2 2 1

    8 4 4 2

    2 2 2 1

    2 1 2 1

    4 1 2 1

    English Terms Spanish Terms

  • 8/2/2019 Good Template

    38/58

    38

    Similarity-Based Dictionaries

    Automatically developed from aligned documents

    Terms E1 and E3 are used in similar ways

    Terms E1 & S1 (or E3 & S4) are even more similar For each term, find most similar in other language

    Retain only the top few (5 or so)

    Performs as well as dictionary-based techniques

    Evaluated on a comparable corpus of news stories

    Stories were automatically linked based on date and subject

  • 8/2/2019 Good Template

    39/58

    39

    Generalized Vector Space Model

    Term space of each language is different

    But the document space for a corpus is the same

    Describe new documents based on the corpus Vector of cosine similarity to each corpus document

    Easily generated from a vector of term weights

    Multiply by the term-document matrix

    Compute cosine similarity in document space

    Excellent results when the domain is the same

  • 8/2/2019 Good Template

    40/58

    40

    Latent Semantic Indexing

    Designed for better monolingual effectiveness

    Works well across languages too

    Cross-language is just a type of term choice variation Produces short dense document vectors

    Better than long sparse ones for adaptive filtering

    Training data needs grow with dimensionality

    Not as good for retrieval efficiency

    Always 300 multiplications, even for short queries

  • 8/2/2019 Good Template

    41/58

    41

    Sentence-Aligned Parallel Corpora

    Easily constructed from aligned documents

    Match pattern of relative sentence lengths

    Not yet used directly for effective retrieval But all experiments have included domain shift

    Good first step for term alignment

    Sentences define a natural context

  • 8/2/2019 Good Template

    42/58

    42

    Cooccurrence-Based Translation

    Align terms using cooccurrence statistics

    How often do a term pair occur in sentence pairs?

    Weighted by relative position in the sentences Retain term pairs that occur unusually often

    Useful for query translation

    Excellent results when the domain is the same

    Also practical for document translation

    Term usage reinforces good translations

  • 8/2/2019 Good Template

    43/58

    43

    Exploiting Unaligned Corpora

    Documents about the same set of subjects

    No known relationship between document pairs

    Easily available in many applications Two approaches

    Use a dictionary for rough translation

    But refine it using the unaligned bilingual corpus

    Use a dictionary to find alignments in the corpus

    Then extract translation knowledge from the alignments

  • 8/2/2019 Good Template

    44/58

    44

    Feedback with Unaligned Corpora

    Pseudo-relevance feedback is fully automatic

    Augment the query with top ranked documents

    Improves recall Recenters queries based on the corpus

    Short queries get the most dramatic improvement

    Two opportunities:

    Query language: Improve the query

    Document language: Suppress translation error

  • 8/2/2019 Good Template

    45/58

    45

    Context Linking

    Automatically align portions of documents

    For each query term:

    Find translation pairs in corpus using dictionary Select a context of nearby terms

    e.g., +/- 5 words in each language

    Choose translations from most similar contexts

    Based on cooccurrence with other translation pairs No reported experimental results

  • 8/2/2019 Good Template

    46/58

    46

    Problems with CLIR

    Morphological processing difficult for somelanguages (e.g. Arabic) Many different encodings for Arabic

    Windows Arabic (e.g. dictionaries)

    Unicode (UTF-8) (e.g. corpus)

    Macintosh Arabic (e.g. queries)

    Normalization Remove diacritics

    to Arabic (language)

    Standardize spellings for foreign names vs Kleentoon vs Klntoon for Clinton

  • 8/2/2019 Good Template

    47/58

    47

    Problems with CLIR (contd)

    Morphological processing (contd.) Arabic stemming

    Root + patterns+suffixes+prefixes=wordktb+CiCaC=kitabAll verbs and nouns derived from fewer than 2000 roots

    Roots too abstract for information retrievalktb kitaba bookkitabi my bookalkitabthe bookkitabuki your book (f)kataba to writekitabuka your book (m)maktabofficekitabuhu his book

    maktaba library, bookstore ...Want stem=root+pattern+derivational affixes?No standard stemmers available,only morphological (root) analyzers

  • 8/2/2019 Good Template

    48/58

    CLIR b h IR?

  • 8/2/2019 Good Template

    49/58

    49

    CLIR better than IR?

    How can cross-language beat within-language? We knowthere are translation errors Surely those errors should hurtperformance

    Hypothesis is that translation process may disambiguate

    some query terms Words that are ambiguous in Arabic may not be ambiguous inEnglish

    Expansion during translation from English to Arabic prevents theambiguity from re-appearing

    Has been proposed that CLIR is a model for IR Translate query into one language and then back to original Given hypothesis, should have an improved query Should be reasonable to do this across many different languages

    L D i L

  • 8/2/2019 Good Template

    50/58

    50

    Low Density Languages

    Languages for which few on-line resources exist

    Rumor has it that 25 languages are well represented on Web

    Extreme is kitchen languages that are only spoken

    More extreme: a language made up of whistling

    Corpus to be searched may also be very small

    Bilingual dictionaries often exist in print, may need to useinterlingua such as French

    Some approaches, such as those relying on translationprobabilities may not work well

    Solution depends on specific application

  • 8/2/2019 Good Template

    51/58

    51

    Performance Evaluation

    C t ti T t C ll ti

  • 8/2/2019 Good Template

    52/58

    52

    Constructing Test Collections

    One collection for retrospective retrieval

    Start with a monolingual test collection

    Documents, queries, relevance judgments Translate the queries by hand

    Need 2 collections for adaptive filtering

    Monolingual test collection in one language

    Plus a document collection in the other language

    Generate relevance judgments for the same queries

    E l i C B d T h i

  • 8/2/2019 Good Template

    53/58

    53

    Evaluating Corpus-Based Techniques

    Same domain evaluation

    Partition a bilingual corpus

    Design queries Generate relevance judgments for evaluation part

    Cross-domain evaluation

    Can use existing collections and corpora

    No good metric for degree of domain shift

    E l ti E l

  • 8/2/2019 Good Template

    54/58

    54

    Evaluation Example

    Corpus-based samedomain evaluation

    Use average precision as

    figure of merit

    Technique Cross-lang Mono-lingual Ratio

    Cooccurrence-based dictionary 0.43 0.47 91%

    Pseudo-relevance feedback 0.40 0.44 90%

    Generalized vector space model 0.38 0.40 95%

    Latent semantic indexing 0.31 0.37 84%

    Dictionary-based translation 0.29 0.47 61%

  • 8/2/2019 Good Template

    55/58

    55

    User Interface Design

    Q F l ti

  • 8/2/2019 Good Template

    56/58

    56

    Query Formulation

    Interactive word sense disambiguation

    Show users the translated query

    Retranslate it for monolingual users

    Provide an easy way of adjusting it

    But dont require that users adjust or approve it

    S l ti d E i ti

  • 8/2/2019 Good Template

    57/58

    57

    Selection and Examination

    Document selection is a decision process

    Relevance feedback, problem refinement, read it

    Based on factors not used by the retrieval system Provide information to support that decision

    May not require very good translations

    e.g., Word-by-word title translation

    People can read past some ambiguity

    May help to display a few alternative translations

    R f

  • 8/2/2019 Good Template

    58/58

    58

    References

    Miguel E. Ruiz. Cross Language Information Retrieval (CLIR). Powerpoint presentation, University of Buffalo. 2002

    Douglas W Oard, Bonnie J Dorr. A Survey of Multilingual TextRetrieval .1996

    Jian-Yun Nie: Cross-Language Information Retrieval. IEEE

    Computational Intelligence Bulletin 2(1): 19-24 (2003) Hansen, Preben and Petrelli, Daniela and Karlgren, Jussi andBeaulieu, Micheline and Sanderson, Mark (2002) User-CenteredInterface Design for Cross-Language Information Retrieval. In:Proceedings of the Twenty-fifth Annual International ACM SIGIRConference on Research and Development in Information Retrieval,

    Tampere, Finland. 2002 Elizabeth D. Liddy and Anne R. Diekema. Cross-Language

    Information Exploitation of Arabic. Power point presentationApril 2005