good template

8/2/2019 Good Template

1/58

1

Cross Language Information

Retrieval (CLIR)

Modern Information Retrieval

Sharif University of Technology

Fall 2005

Mohsen Jamali


2/58

2

The General Problem

Find documents written in any language

Using queries expressed in a single language


3/58

3

The General Problem (cont)

Traditional IR identifies relevant documents in thesame language as the query (monolingual IR)

Cross-language information retrieval (CLIR) tries

to identify relevant documents in a languagedifferent from that of the query

This problem is more and more acute for IR on the

Web due to the fact that the Web is a trulymultilingual environment


4/58

4

Why is CLIR important?


5/58

5

Characteristics of the WWW

Country of Origin of Public Web Sites, 2001 (% of Total)(OCLC Web Characterization, 2001)


6/58

6Source: Global Reach

5%

9%

6%

5%

5%

4%

4%

3%

2%

2%

3%

52%

8%

8%

5%

3%

21%

2%

5%2%

5%

3%

6%

32%

Spanish Japanese German French

Chinese Scandanavian Italian Dutch

Korean Portuguese Other English

EnglishEnglish

2000 2005

Global Internet User Population

Chinese


7/58

7

Importance of CLIR

CLIR research is becoming more and moreimportant for global information exchange andknowledge sharing.

National Security

Foreign Patent Information Access

Medical Information Access for Patients


8/58

8

CLIR is Multidisciplinary

CLIR involves researchers from thefollowing fields:

information retrieval, natural languageprocessing, machine translation andsummarization, speech processing,document image understanding,

human-computer interaction


9/58

9

User Needs

Search a monolingual collection in a language thatthe user cannot read.

Retrieve information from a multilingual collection

using a query in a single language.

Select images from a collection indexed with freetext captions in an unfamiliar language.

Locate documents in a multilingual collection ofscanned page images.


10/58

10

Why Do Cross-Language IR?

When users can read several languages

Eliminates multiple queries

Query in most fluent language

Monolingual users can also benefit

If translations can be provided

If it suffices to know that a document exists

If text captions are used to search for images


11/58

11

Language Identification

Can be specified using metadata

Included in HTTP and HTML

Determined using word-scale features

Which dictionary gets the most hits?

Determined using subword features

Letter n-grams in electronic and printed text

Phoneme n-grams in speech


12/58

12

Language Encoding Standards

Language (alphabet) specific native encoding:

Chinese GB, Big5,

Western European ISO-8859-1 (Latin1)

Russian KOI-8, ISO-8859-5, CP-1251

UNICODE (ISO/IEC 10646)

UTF-8 variable-byte length UTF-16, UCS-2 fixed double-byte


13/58

13

CLIR Experimental System

2 systems: SMART Information retrieval system modified to work with 11

European languages (Danish, Dutch, English, Finnish, French,German, Italian, Norwegian, Portuguese, Spanish, Swedish)

Generation of restricted bigrams

Pseudo-Relevance feedback TAPIR is a language model IR system written by M. Srikanth. It

has been adated to work with 12 different European languages(Danish, Dutch, English, Finnish, French, German, Italian,Norwegian, Portuguese, Russian, Spanish, Swedish)

Stemming using Porters stemmer Translation using Intertran(http://www.tranexp.com:2000/InterTran)
http://www.tranexp.com:2000/InterTranhttp://www.tranexp.com:2000/InterTran


14/58

14

Approaches to CLIR


15/58

15

Design Decisions

What to index?

Free text or controlled vocabulary

What to translate?

Queries or documents

Where to get translation knowledge?

Dictionary, ontology, training corpus


16/58

16

Term-aligned Sentence-aligned Document-aligned Unaligned

Parallel Comparable

Knowledge-based Corpus-based

Controlled Vocabulary Free Text

Cross-Language Text Retrieval

Query Translation Document Translation

Text Translation Vector Translation

Ontology-based Dictionary-based

Thesaurus-based


17/58


18/58

18

Controlled Vocabulary Matures

1977 IBM STAIRS-TLS

Large-scale commercial cross-language IR

1978 ISO Standard 5964

Guidelines for developing multilingual thesauri

1984 EUROVOC thesaurus

Now includes all 9 EC languages

1985 ISO Standard 5964 revised


19/58

19

Free Text Developments

1970, 1973 Salton

Hand coded bilingual term lists

1990 Latent Semantic Indexing

1994 European multilingual IR project

First precision/recall evaluation

1996 SIGIR Cross-lingual IR workshop

1998 EU/NSF digital library working group


20/58

20

Query vs. Document Translation

Query translation

Very efficient for short queries

Not as big an advantage for relevance feedback

Hard to resolve ambiguous query terms

Document translation

May be needed by the selection interface

And supports adaptive filtering well

Slow, but only need to do it once per document

Poor scale-up to large numbers of languages


21/58

21

Document Translation Example

Approach

Select a single query language

Translate every document into that language Perform monolingual retrieval

Long documents provide enough context

And many translation errors do not hurt retrieval

Much of the generation effort is wasted

And choosing a single translation can hurt


22/58

22

Text Translation

One weakness of present fully automatic machinetranslation systems is that they are able to produce highquality translations only in limited domains

Text retrieval systems are typically more tolerant ofsyntactic than semantic translation errors but thatsemantic accuracy suffers when insufficient domainknowledge is encoded into a translation system

In fact some of the work done by a machine translationsystem could actually reduce some measures of retrieval

effectiveness


23/58

23

Query Translation Example

Select controlled vocabulary search terms

Retrieve documents in desired language

Form monolingual query from the documents

Perform a monolingual free text search

Information

NeedThesaurus

Controlled

Vocabulary

Multilingual

Text Retrieval

System

Alta Vista

FrenchQuery

Terms

English

Abstracts

EnglishWeb

Pages


24/58

24

Query Translation

Queries (E) Query

Translation

ChineseIR System

MTSystem

An English-Chinese CLIR System

Queries (C)

Results (C)Results (E)

ChineseDocuments


25/58

25

Controlled Vocabulary

A controlled vocabulary information retrieval system canbe very useful in the hands of a skilled searcher, but endusers often find free text searching to be more helpful.

Experience has shown that although the domain

knowledge that can be encoded in a thesaurus permitsexperienced users to form more precise queries casualand intermittent users have diffculty exploiting theexpressive power of a traditional query interface in exactmatch retrieval systems

Controlled vocabulary text retrieval systems are widelyused in libraries and user needs assessment has receivedconsiderable attention from library and information scienceresearchers.


26/58

26

Knowledge-based Techniquesfor Free Text Searching


27/58

27

Knowledge Structures for IR

Ontology Representation of concepts and relationships

Thesaurus

Ontology specialized for retrieval Bilingual lexicon

Ontology specialized for machine translation

Bilingual dictionary Ontology specialized for human translation


28/58

28

Machine Readable Dictionaries

Based on printed bilingual dictionaries

Becoming widely available

Used to produce bilingual term lists Cross-language term mappings are accessible

Sometimes listed in order of most common usage

Some knowledge structure is also present

Hard to extract and represent automatically

The challenge is to pick the right translation


29/58

29

CLIR: Dictionary Based

Problems

Limitations of dictionaries

Inflected word forms

Phrases and compound words

Lexical ambiguity

Possible solution

Approximate string matching

Unconstrained Query


30/58

30

Unconstrained QueryTranslation

Replace each word with every translation

Typically 5-10 translations per word

About 50% of monolingual effectiveness Ambiguity is a serious problem

Example: Fly (English)

8 word senses (e.g., to fly a flag)

13 Spanish translations (enarbolar, ondear, )

38 English retranslations (hoist, brandish, lift)


31/58

31

Linguistic

Analysisand

LanguageSupport

MachineTranslation

MultilingualSearch Engine

User

Interface

French

Query

Spanish

QueryChineseQuery

EnglishQuery

SpanishDatabase

French

Database

Chinese

Database

English

Database

Merging

Results

EnglishUser

Query

Expanded

Query

List ofResults

Requests

for DocumentTranslation


32/58

32

Exploiting Part-of-Speech Tags

Constrain translations by part of speech

Noun, verb, adjective,

Effective taggers are available

Works well when queries are full sentences

Short queries provide little basis for tagging

Constrained matching can hurt monolingual IR

Nouns in queries often match verbs in documents


33/58

33

Phrase Indexing

Improves retrieval effectiveness two ways

Phrases are less ambiguous than single words

Idiomatic phrases translate as a single concept Three ways to identify phrases

Semantic (e.g., appears in a dictionary)

Syntactic (e.g., parse as a noun phrase)

Cooccurrence (words found together often)

Semantic phrase results are impressive


34/58

34

Corpus-based Techniquesfor Free Text Searching


35/58

35

Types of Bilingual Corpora

Parallel corpora: translation-equivalent pairs

Document pairs

Sentence pairs

Term pairs

Comparable corpora

Content-equivalent document pairs

Unaligned corpora Content from the same domain


36/58

36

Pseudo-Relevance Feedback

Enter query terms in French

Find top French documents in parallel corpus

Construct a query from English translations

Perform a monolingual free text search

Top ranked

FrenchDocumentsFrench

Text

Retrieval

System

Alta Vista

French

QueryTerms

EnglishTranslations

English

WebPages

Parallel

Corpus


37/58

37

Learning From Document Pairs

Count how often each term occurs in each pair

Treat each pair as a single document

E1 E2 E3 E4 E5 S1 S2 S3 S4

Doc 1

Doc 2

Doc 3

Doc 4

Doc 5

4 2 2 1

8 4 4 2

2 2 2 1

2 1 2 1

4 1 2 1

English Terms Spanish Terms


38/58

38

Similarity-Based Dictionaries

Automatically developed from aligned documents

Terms E1 and E3 are used in similar ways

Terms E1 & S1 (or E3 & S4) are even more similar For each term, find most similar in other language

Retain only the top few (5 or so)

Performs as well as dictionary-based techniques

Evaluated on a comparable corpus of news stories

Stories were automatically linked based on date and subject


39/58

39

Generalized Vector Space Model

Term space of each language is different

But the document space for a corpus is the same

Describe new documents based on the corpus Vector of cosine similarity to each corpus document

Easily generated from a vector of term weights

Multiply by the term-document matrix

Compute cosine similarity in document space

Excellent results when the domain is the same


40/58

40

Latent Semantic Indexing

Designed for better monolingual effectiveness

Works well across languages too

Cross-language is just a type of term choice variation Produces short dense document vectors

Better than long sparse ones for adaptive filtering

Training data needs grow with dimensionality

Not as good for retrieval efficiency

Always 300 multiplications, even for short queries


41/58

41

Sentence-Aligned Parallel Corpora

Easily constructed from aligned documents

Match pattern of relative sentence lengths

Not yet used directly for effective retrieval But all experiments have included domain shift

Good first step for term alignment

Sentences define a natural context


42/58

42

Cooccurrence-Based Translation

Align terms using cooccurrence statistics

How often do a term pair occur in sentence pairs?

Weighted by relative position in the sentences Retain term pairs that occur unusually often

Useful for query translation

Excellent results when the domain is the same

Also practical for document translation

Term usage reinforces good translations


43/58

43

Exploiting Unaligned Corpora

Documents about the same set of subjects

No known relationship between document pairs

Easily available in many applications Two approaches

Use a dictionary for rough translation

But refine it using the unaligned bilingual corpus

Use a dictionary to find alignments in the corpus

Then extract translation knowledge from the alignments


44/58

44

Feedback with Unaligned Corpora

Pseudo-relevance feedback is fully automatic

Augment the query with top ranked documents

Improves recall Recenters queries based on the corpus

Short queries get the most dramatic improvement

Two opportunities:

Query language: Improve the query

Document language: Suppress translation error


45/58

45

Context Linking

Automatically align portions of documents

For each query term:

Find translation pairs in corpus using dictionary Select a context of nearby terms

e.g., +/- 5 words in each language

Choose translations from most similar contexts

Based on cooccurrence with other translation pairs No reported experimental results


46/58

46

Problems with CLIR

Morphological processing difficult for somelanguages (e.g. Arabic) Many different encodings for Arabic

Windows Arabic (e.g. dictionaries)

Unicode (UTF-8) (e.g. corpus)

Macintosh Arabic (e.g. queries)

Normalization Remove diacritics

to Arabic (language)

Standardize spellings for foreign names vs Kleentoon vs Klntoon for Clinton


47/58

47

Problems with CLIR (contd)

Morphological processing (contd.) Arabic stemming

Root + patterns+suffixes+prefixes=wordktb+CiCaC=kitabAll verbs and nouns derived from fewer than 2000 roots

Roots too abstract for information retrievalktb kitaba bookkitabi my bookalkitabthe bookkitabuki your book (f)kataba to writekitabuka your book (m)maktabofficekitabuhu his book

maktaba library, bookstore ...Want stem=root+pattern+derivational affixes?No standard stemmers available,only morphological (root) analyzers


48/58

CLIR b h IR?


49/58

49

CLIR better than IR?

How can cross-language beat within-language? We knowthere are translation errors Surely those errors should hurtperformance

Hypothesis is that translation process may disambiguate

some query terms Words that are ambiguous in Arabic may not be ambiguous inEnglish

Expansion during translation from English to Arabic prevents theambiguity from re-appearing

Has been proposed that CLIR is a model for IR Translate query into one language and then back to original Given hypothesis, should have an improved query Should be reasonable to do this across many different languages

L D i L


50/58

50

Low Density Languages

Languages for which few on-line resources exist

Rumor has it that 25 languages are well represented on Web

Extreme is kitchen languages that are only spoken

More extreme: a language made up of whistling

Corpus to be searched may also be very small

Bilingual dictionaries often exist in print, may need to useinterlingua such as French

Some approaches, such as those relying on translationprobabilities may not work well

Solution depends on specific application


51/58

51

Performance Evaluation

C t ti T t C ll ti


52/58

52

Constructing Test Collections

One collection for retrospective retrieval

Start with a monolingual test collection

Documents, queries, relevance judgments Translate the queries by hand

Need 2 collections for adaptive filtering

Monolingual test collection in one language

Plus a document collection in the other language

Generate relevance judgments for the same queries

E l i C B d T h i


53/58

53

Evaluating Corpus-Based Techniques

Same domain evaluation

Partition a bilingual corpus

Design queries Generate relevance judgments for evaluation part

Cross-domain evaluation

Can use existing collections and corpora

No good metric for degree of domain shift

E l ti E l


54/58

54

Evaluation Example

Corpus-based samedomain evaluation

Use average precision as

figure of merit

Technique Cross-lang Mono-lingual Ratio

Cooccurrence-based dictionary 0.43 0.47 91%

Pseudo-relevance feedback 0.40 0.44 90%

Generalized vector space model 0.38 0.40 95%

Latent semantic indexing 0.31 0.37 84%

Dictionary-based translation 0.29 0.47 61%


55/58

55

User Interface Design

Q F l ti


56/58

56

Query Formulation

Interactive word sense disambiguation

Show users the translated query

Retranslate it for monolingual users

Provide an easy way of adjusting it

But dont require that users adjust or approve it

S l ti d E i ti


57/58

57

Selection and Examination

Document selection is a decision process

Relevance feedback, problem refinement, read it

Based on factors not used by the retrieval system Provide information to support that decision

May not require very good translations

e.g., Word-by-word title translation

People can read past some ambiguity

May help to display a few alternative translations

R f


58/58

58

References

Miguel E. Ruiz. Cross Language Information Retrieval (CLIR). Powerpoint presentation, University of Buffalo. 2002

Douglas W Oard, Bonnie J Dorr. A Survey of Multilingual TextRetrieval .1996

Jian-Yun Nie: Cross-Language Information Retrieval. IEEE

Computational Intelligence Bulletin 2(1): 19-24 (2003) Hansen, Preben and Petrelli, Daniela and Karlgren, Jussi andBeaulieu, Micheline and Sanderson, Mark (2002) User-CenteredInterface Design for Cross-Language Information Retrieval. In:Proceedings of the Twenty-fifth Annual International ACM SIGIRConference on Research and Development in Information Retrieval,

Tampere, Finland. 2002 Elizabeth D. Liddy and Anne R. Diekema. Cross-Language

Information Exploitation of Arabic. Power point presentationApril 2005