Download - Good Template
-
8/2/2019 Good Template
1/58
1
Cross Language Information
Retrieval (CLIR)
Modern Information Retrieval
Sharif University of Technology
Fall 2005
Mohsen Jamali
-
8/2/2019 Good Template
2/58
2
The General Problem
Find documents written in any language
Using queries expressed in a single language
-
8/2/2019 Good Template
3/58
3
The General Problem (cont)
Traditional IR identifies relevant documents in thesame language as the query (monolingual IR)
Cross-language information retrieval (CLIR) tries
to identify relevant documents in a languagedifferent from that of the query
This problem is more and more acute for IR on the
Web due to the fact that the Web is a trulymultilingual environment
-
8/2/2019 Good Template
4/58
4
Why is CLIR important?
-
8/2/2019 Good Template
5/58
5
Characteristics of the WWW
Country of Origin of Public Web Sites, 2001 (% of Total)(OCLC Web Characterization, 2001)
-
8/2/2019 Good Template
6/58
6Source: Global Reach
5%
9%
6%
5%
5%
4%
4%
3%
2%
2%
3%
52%
8%
8%
5%
3%
21%
2%
5%2%
5%
3%
6%
32%
Spanish Japanese German French
Chinese Scandanavian Italian Dutch
Korean Portuguese Other English
EnglishEnglish
2000 2005
Global Internet User Population
Chinese
-
8/2/2019 Good Template
7/58
7
Importance of CLIR
CLIR research is becoming more and moreimportant for global information exchange andknowledge sharing.
National Security
Foreign Patent Information Access
Medical Information Access for Patients
-
8/2/2019 Good Template
8/58
8
CLIR is Multidisciplinary
CLIR involves researchers from thefollowing fields:
information retrieval, natural languageprocessing, machine translation andsummarization, speech processing,document image understanding,
human-computer interaction
-
8/2/2019 Good Template
9/58
9
User Needs
Search a monolingual collection in a language thatthe user cannot read.
Retrieve information from a multilingual collection
using a query in a single language.
Select images from a collection indexed with freetext captions in an unfamiliar language.
Locate documents in a multilingual collection ofscanned page images.
-
8/2/2019 Good Template
10/58
10
Why Do Cross-Language IR?
When users can read several languages
Eliminates multiple queries
Query in most fluent language
Monolingual users can also benefit
If translations can be provided
If it suffices to know that a document exists
If text captions are used to search for images
-
8/2/2019 Good Template
11/58
11
Language Identification
Can be specified using metadata
Included in HTTP and HTML
Determined using word-scale features
Which dictionary gets the most hits?
Determined using subword features
Letter n-grams in electronic and printed text
Phoneme n-grams in speech
-
8/2/2019 Good Template
12/58
12
Language Encoding Standards
Language (alphabet) specific native encoding:
Chinese GB, Big5,
Western European ISO-8859-1 (Latin1)
Russian KOI-8, ISO-8859-5, CP-1251
UNICODE (ISO/IEC 10646)
UTF-8 variable-byte length UTF-16, UCS-2 fixed double-byte
-
8/2/2019 Good Template
13/58
13
CLIR Experimental System
2 systems: SMART Information retrieval system modified to work with 11
European languages (Danish, Dutch, English, Finnish, French,German, Italian, Norwegian, Portuguese, Spanish, Swedish)
Generation of restricted bigrams
Pseudo-Relevance feedback TAPIR is a language model IR system written by M. Srikanth. It
has been adated to work with 12 different European languages(Danish, Dutch, English, Finnish, French, German, Italian,Norwegian, Portuguese, Russian, Spanish, Swedish)
Stemming using Porters stemmer Translation using Intertran(http://www.tranexp.com:2000/InterTran)
http://www.tranexp.com:2000/InterTranhttp://www.tranexp.com:2000/InterTran -
8/2/2019 Good Template
14/58
14
Approaches to CLIR
-
8/2/2019 Good Template
15/58
15
Design Decisions
What to index?
Free text or controlled vocabulary
What to translate?
Queries or documents
Where to get translation knowledge?
Dictionary, ontology, training corpus
-
8/2/2019 Good Template
16/58
16
Term-aligned Sentence-aligned Document-aligned Unaligned
Parallel Comparable
Knowledge-based Corpus-based
Controlled Vocabulary Free Text
Cross-Language Text Retrieval
Query Translation Document Translation
Text Translation Vector Translation
Ontology-based Dictionary-based
Thesaurus-based
-
8/2/2019 Good Template
17/58
-
8/2/2019 Good Template
18/58
18
Controlled Vocabulary Matures
1977 IBM STAIRS-TLS
Large-scale commercial cross-language IR
1978 ISO Standard 5964
Guidelines for developing multilingual thesauri
1984 EUROVOC thesaurus
Now includes all 9 EC languages
1985 ISO Standard 5964 revised
-
8/2/2019 Good Template
19/58
19
Free Text Developments
1970, 1973 Salton
Hand coded bilingual term lists
1990 Latent Semantic Indexing
1994 European multilingual IR project
First precision/recall evaluation
1996 SIGIR Cross-lingual IR workshop
1998 EU/NSF digital library working group
-
8/2/2019 Good Template
20/58
20
Query vs. Document Translation
Query translation
Very efficient for short queries
Not as big an advantage for relevance feedback
Hard to resolve ambiguous query terms
Document translation
May be needed by the selection interface
And supports adaptive filtering well
Slow, but only need to do it once per document
Poor scale-up to large numbers of languages
-
8/2/2019 Good Template
21/58
21
Document Translation Example
Approach
Select a single query language
Translate every document into that language Perform monolingual retrieval
Long documents provide enough context
And many translation errors do not hurt retrieval
Much of the generation effort is wasted
And choosing a single translation can hurt
-
8/2/2019 Good Template
22/58
22
Text Translation
One weakness of present fully automatic machinetranslation systems is that they are able to produce highquality translations only in limited domains
Text retrieval systems are typically more tolerant ofsyntactic than semantic translation errors but thatsemantic accuracy suffers when insufficient domainknowledge is encoded into a translation system
In fact some of the work done by a machine translationsystem could actually reduce some measures of retrieval
effectiveness
-
8/2/2019 Good Template
23/58
23
Query Translation Example
Select controlled vocabulary search terms
Retrieve documents in desired language
Form monolingual query from the documents
Perform a monolingual free text search
Information
NeedThesaurus
Controlled
Vocabulary
Multilingual
Text Retrieval
System
Alta Vista
FrenchQuery
Terms
English
Abstracts
EnglishWeb
Pages
-
8/2/2019 Good Template
24/58
24
Query Translation
Queries (E) Query
Translation
ChineseIR System
MTSystem
An English-Chinese CLIR System
Queries (C)
Results (C)Results (E)
ChineseDocuments
-
8/2/2019 Good Template
25/58
25
Controlled Vocabulary
A controlled vocabulary information retrieval system canbe very useful in the hands of a skilled searcher, but endusers often find free text searching to be more helpful.
Experience has shown that although the domain
knowledge that can be encoded in a thesaurus permitsexperienced users to form more precise queries casualand intermittent users have diffculty exploiting theexpressive power of a traditional query interface in exactmatch retrieval systems
Controlled vocabulary text retrieval systems are widelyused in libraries and user needs assessment has receivedconsiderable attention from library and information scienceresearchers.
-
8/2/2019 Good Template
26/58
26
Knowledge-based Techniquesfor Free Text Searching
-
8/2/2019 Good Template
27/58
27
Knowledge Structures for IR
Ontology Representation of concepts and relationships
Thesaurus
Ontology specialized for retrieval Bilingual lexicon
Ontology specialized for machine translation
Bilingual dictionary Ontology specialized for human translation
-
8/2/2019 Good Template
28/58
28
Machine Readable Dictionaries
Based on printed bilingual dictionaries
Becoming widely available
Used to produce bilingual term lists Cross-language term mappings are accessible
Sometimes listed in order of most common usage
Some knowledge structure is also present
Hard to extract and represent automatically
The challenge is to pick the right translation
-
8/2/2019 Good Template
29/58
29
CLIR: Dictionary Based
Problems
Limitations of dictionaries
Inflected word forms
Phrases and compound words
Lexical ambiguity
Possible solution
Approximate string matching
Unconstrained Query
-
8/2/2019 Good Template
30/58
30
Unconstrained QueryTranslation
Replace each word with every translation
Typically 5-10 translations per word
About 50% of monolingual effectiveness Ambiguity is a serious problem
Example: Fly (English)
8 word senses (e.g., to fly a flag)
13 Spanish translations (enarbolar, ondear, )
38 English retranslations (hoist, brandish, lift)
-
8/2/2019 Good Template
31/58
31
Linguistic
Analysisand
LanguageSupport
MachineTranslation
MultilingualSearch Engine
User
Interface
French
Query
Spanish
QueryChineseQuery
EnglishQuery
SpanishDatabase
French
Database
Chinese
Database
English
Database
Merging
Results
EnglishUser
Query
Expanded
Query
List ofResults
Requests
for DocumentTranslation
-
8/2/2019 Good Template
32/58
32
Exploiting Part-of-Speech Tags
Constrain translations by part of speech
Noun, verb, adjective,
Effective taggers are available
Works well when queries are full sentences
Short queries provide little basis for tagging
Constrained matching can hurt monolingual IR
Nouns in queries often match verbs in documents
-
8/2/2019 Good Template
33/58
33
Phrase Indexing
Improves retrieval effectiveness two ways
Phrases are less ambiguous than single words
Idiomatic phrases translate as a single concept Three ways to identify phrases
Semantic (e.g., appears in a dictionary)
Syntactic (e.g., parse as a noun phrase)
Cooccurrence (words found together often)
Semantic phrase results are impressive
-
8/2/2019 Good Template
34/58
34
Corpus-based Techniquesfor Free Text Searching
-
8/2/2019 Good Template
35/58
35
Types of Bilingual Corpora
Parallel corpora: translation-equivalent pairs
Document pairs
Sentence pairs
Term pairs
Comparable corpora
Content-equivalent document pairs
Unaligned corpora Content from the same domain
-
8/2/2019 Good Template
36/58
36
Pseudo-Relevance Feedback
Enter query terms in French
Find top French documents in parallel corpus
Construct a query from English translations
Perform a monolingual free text search
Top ranked
FrenchDocumentsFrench
Text
Retrieval
System
Alta Vista
French
QueryTerms
EnglishTranslations
English
WebPages
Parallel
Corpus
-
8/2/2019 Good Template
37/58
37
Learning From Document Pairs
Count how often each term occurs in each pair
Treat each pair as a single document
E1 E2 E3 E4 E5 S1 S2 S3 S4
Doc 1
Doc 2
Doc 3
Doc 4
Doc 5
4 2 2 1
8 4 4 2
2 2 2 1
2 1 2 1
4 1 2 1
English Terms Spanish Terms
-
8/2/2019 Good Template
38/58
38
Similarity-Based Dictionaries
Automatically developed from aligned documents
Terms E1 and E3 are used in similar ways
Terms E1 & S1 (or E3 & S4) are even more similar For each term, find most similar in other language
Retain only the top few (5 or so)
Performs as well as dictionary-based techniques
Evaluated on a comparable corpus of news stories
Stories were automatically linked based on date and subject
-
8/2/2019 Good Template
39/58
39
Generalized Vector Space Model
Term space of each language is different
But the document space for a corpus is the same
Describe new documents based on the corpus Vector of cosine similarity to each corpus document
Easily generated from a vector of term weights
Multiply by the term-document matrix
Compute cosine similarity in document space
Excellent results when the domain is the same
-
8/2/2019 Good Template
40/58
40
Latent Semantic Indexing
Designed for better monolingual effectiveness
Works well across languages too
Cross-language is just a type of term choice variation Produces short dense document vectors
Better than long sparse ones for adaptive filtering
Training data needs grow with dimensionality
Not as good for retrieval efficiency
Always 300 multiplications, even for short queries
-
8/2/2019 Good Template
41/58
41
Sentence-Aligned Parallel Corpora
Easily constructed from aligned documents
Match pattern of relative sentence lengths
Not yet used directly for effective retrieval But all experiments have included domain shift
Good first step for term alignment
Sentences define a natural context
-
8/2/2019 Good Template
42/58
42
Cooccurrence-Based Translation
Align terms using cooccurrence statistics
How often do a term pair occur in sentence pairs?
Weighted by relative position in the sentences Retain term pairs that occur unusually often
Useful for query translation
Excellent results when the domain is the same
Also practical for document translation
Term usage reinforces good translations
-
8/2/2019 Good Template
43/58
43
Exploiting Unaligned Corpora
Documents about the same set of subjects
No known relationship between document pairs
Easily available in many applications Two approaches
Use a dictionary for rough translation
But refine it using the unaligned bilingual corpus
Use a dictionary to find alignments in the corpus
Then extract translation knowledge from the alignments
-
8/2/2019 Good Template
44/58
44
Feedback with Unaligned Corpora
Pseudo-relevance feedback is fully automatic
Augment the query with top ranked documents
Improves recall Recenters queries based on the corpus
Short queries get the most dramatic improvement
Two opportunities:
Query language: Improve the query
Document language: Suppress translation error
-
8/2/2019 Good Template
45/58
45
Context Linking
Automatically align portions of documents
For each query term:
Find translation pairs in corpus using dictionary Select a context of nearby terms
e.g., +/- 5 words in each language
Choose translations from most similar contexts
Based on cooccurrence with other translation pairs No reported experimental results
-
8/2/2019 Good Template
46/58
46
Problems with CLIR
Morphological processing difficult for somelanguages (e.g. Arabic) Many different encodings for Arabic
Windows Arabic (e.g. dictionaries)
Unicode (UTF-8) (e.g. corpus)
Macintosh Arabic (e.g. queries)
Normalization Remove diacritics
to Arabic (language)
Standardize spellings for foreign names vs Kleentoon vs Klntoon for Clinton
-
8/2/2019 Good Template
47/58
47
Problems with CLIR (contd)
Morphological processing (contd.) Arabic stemming
Root + patterns+suffixes+prefixes=wordktb+CiCaC=kitabAll verbs and nouns derived from fewer than 2000 roots
Roots too abstract for information retrievalktb kitaba bookkitabi my bookalkitabthe bookkitabuki your book (f)kataba to writekitabuka your book (m)maktabofficekitabuhu his book
maktaba library, bookstore ...Want stem=root+pattern+derivational affixes?No standard stemmers available,only morphological (root) analyzers
-
8/2/2019 Good Template
48/58
CLIR b h IR?
-
8/2/2019 Good Template
49/58
49
CLIR better than IR?
How can cross-language beat within-language? We knowthere are translation errors Surely those errors should hurtperformance
Hypothesis is that translation process may disambiguate
some query terms Words that are ambiguous in Arabic may not be ambiguous inEnglish
Expansion during translation from English to Arabic prevents theambiguity from re-appearing
Has been proposed that CLIR is a model for IR Translate query into one language and then back to original Given hypothesis, should have an improved query Should be reasonable to do this across many different languages
L D i L
-
8/2/2019 Good Template
50/58
50
Low Density Languages
Languages for which few on-line resources exist
Rumor has it that 25 languages are well represented on Web
Extreme is kitchen languages that are only spoken
More extreme: a language made up of whistling
Corpus to be searched may also be very small
Bilingual dictionaries often exist in print, may need to useinterlingua such as French
Some approaches, such as those relying on translationprobabilities may not work well
Solution depends on specific application
-
8/2/2019 Good Template
51/58
51
Performance Evaluation
C t ti T t C ll ti
-
8/2/2019 Good Template
52/58
52
Constructing Test Collections
One collection for retrospective retrieval
Start with a monolingual test collection
Documents, queries, relevance judgments Translate the queries by hand
Need 2 collections for adaptive filtering
Monolingual test collection in one language
Plus a document collection in the other language
Generate relevance judgments for the same queries
E l i C B d T h i
-
8/2/2019 Good Template
53/58
53
Evaluating Corpus-Based Techniques
Same domain evaluation
Partition a bilingual corpus
Design queries Generate relevance judgments for evaluation part
Cross-domain evaluation
Can use existing collections and corpora
No good metric for degree of domain shift
E l ti E l
-
8/2/2019 Good Template
54/58
54
Evaluation Example
Corpus-based samedomain evaluation
Use average precision as
figure of merit
Technique Cross-lang Mono-lingual Ratio
Cooccurrence-based dictionary 0.43 0.47 91%
Pseudo-relevance feedback 0.40 0.44 90%
Generalized vector space model 0.38 0.40 95%
Latent semantic indexing 0.31 0.37 84%
Dictionary-based translation 0.29 0.47 61%
-
8/2/2019 Good Template
55/58
55
User Interface Design
Q F l ti
-
8/2/2019 Good Template
56/58
56
Query Formulation
Interactive word sense disambiguation
Show users the translated query
Retranslate it for monolingual users
Provide an easy way of adjusting it
But dont require that users adjust or approve it
S l ti d E i ti
-
8/2/2019 Good Template
57/58
57
Selection and Examination
Document selection is a decision process
Relevance feedback, problem refinement, read it
Based on factors not used by the retrieval system Provide information to support that decision
May not require very good translations
e.g., Word-by-word title translation
People can read past some ambiguity
May help to display a few alternative translations
R f
-
8/2/2019 Good Template
58/58
58
References
Miguel E. Ruiz. Cross Language Information Retrieval (CLIR). Powerpoint presentation, University of Buffalo. 2002
Douglas W Oard, Bonnie J Dorr. A Survey of Multilingual TextRetrieval .1996
Jian-Yun Nie: Cross-Language Information Retrieval. IEEE
Computational Intelligence Bulletin 2(1): 19-24 (2003) Hansen, Preben and Petrelli, Daniela and Karlgren, Jussi andBeaulieu, Micheline and Sanderson, Mark (2002) User-CenteredInterface Design for Cross-Language Information Retrieval. In:Proceedings of the Twenty-fifth Annual International ACM SIGIRConference on Research and Development in Information Retrieval,
Tampere, Finland. 2002 Elizabeth D. Liddy and Anne R. Diekema. Cross-Language
Information Exploitation of Arabic. Power point presentationApril 2005