2009.05.06 - slide 1is 240 – spring 2009 prof. ray larson university of california, berkeley...
Post on 21-Dec-2015
217 views
TRANSCRIPT
2009.05.06 - SLIDE 1IS 240 – Spring 2009
Prof. Ray Larson University of California, Berkeley
School of Information
Principles of Information Retrieval
Lecture 26: CLIR
2009.05.06 - SLIDE 2IS 240 – Spring 2009
Today
• Review– NLP for IR– Text Summarization
• Cross-Language Information Retrieval– Introduction– Cross-Language EVIs
Thanks to Doug Oard (University of Maryland)Fredric Gey and Aitao Chen for some of the material in this lecture
2009.05.06 - SLIDE 3IS 240 – Spring 2009
Today
• Review– NLP for IR– Text Summarization
• Cross-Language Information Retrieval– Introduction– Cross-Language EVIs
Thanks to Doug Oard (University of Maryland)Fredric Gey and Aitao Chen for some of the material in this lecture
2009.05.06 - SLIDE 4IS 240 – Spring 2009
Natural Language Processing and IR
• The main approach in applying NLP to IR has been to attempt to address
– Phrase usage vs individual terms
– Search expansion using related terms/concepts
– Attempts to automatically exploit or assign controlled vocabularies
2009.05.06 - SLIDE 5IS 240 – Spring 2009
NLP and IR
• Much early research showed that (at least in the restricted test databases tested)– Indexing documents by individual terms
corresponding to words and word stems produces retrieval results at least as good as when indexes use controlled vocabularies (whether applied manually or automatically)
– Constructing phrases or “pre-coordinated” terms provides only marginal and inconsistent improvements
2009.05.06 - SLIDE 6IS 240 – Spring 2009
Mark’s idle speculation
• What people think is going on always
Keywords
NLPFrom Mark Sanderson, University of Sheffield
2009.05.06 - SLIDE 7IS 240 – Spring 2009
Mark’s idle speculation
• What’s usually actually going on
KeywordsNLPFrom Mark Sanderson, University of Sheffield
2009.05.06 - SLIDE 8IS 240 – Spring 2009
Pasca and Harabagiu show value of sophisticated NLP
• Good IR is needed: paragraph retrieval based on SMART
• Large taxonomy of question types and expected answer types is crucial
• Statistical parser (modeled on Collins 1997) used to parse questions and relevant text for answers, and to build knowledge base
• Controlled query expansion loops (morphological, lexical synonyms, and semantic relations) are all important
• Answer ranking by simple ML methodfrom Christopher Manning, Stanford Univ.
2009.05.06 - SLIDE 9IS 240 – Spring 2009
Question Answering Example
• How hot does the inside of an active volcano get? • get(TEMPERATURE, inside(volcano(active))) • “lava fragments belched out of the mountain were as hot
as 300 degrees Fahrenheit” • fragments(lava, TEMPERATURE(degrees(300)),
– belched(out, mountain)) – volcano ISA mountain – lava ISPARTOF volcano lava inside volcano – fragments of lava HAVEPROPERTIESOF lava
• The needed semantic information is in WordNet definitions, and was successfully translated into a form that can be used for rough ‘proofs’
from Christopher Manning, Stanford Univ.
2009.05.06 - SLIDE 10IS 240 – Spring 2009
Conclusion
• Complete human-level natural language understanding is still a distant goal
• But there are now practical and usable partial NLU systems applicable to many problems
• An important design decision is in finding an appropriate match between (parts of) the application domain and the available methods
• But, used with care, statistical NLP methods have opened up new possibilities for high performance text understanding systems.
from Christopher Manning, Stanford Univ.
2009.05.06 - SLIDE 11IS 240 – Spring 2009
Today
• Review– NLP for IR– Text Summarization
• Cross-Language Information Retrieval– Introduction– Cross-Language EVIs
Thanks to Doug Oard (University of Maryland)Fredric Gey and Aitao Chen for some of the material in this lecture
2009.05.06 - SLIDE 13IS 240 – Spring 2009
Cross-Language IR
• Given a query expressed in one language
• Find info that may be expressed in another– Electronic texts– Document images– Recorded speech [101]– Sign language
Retrieval SystemEnglish Query French Documents
2009.05.06 - SLIDE 14IS 240 – Spring 2009
Why Do Cross-Language IR?
• When users can read several languages– Eliminates multiple queries– Query in most fluent language
• Monolingual users can also benefit– If translations can be provided– If it suffices to know that a document exists– If text captions are used to search for images
2009.05.06 - SLIDE 15IS 240 – Spring 2009
What We Know
• Dictionaries are very useful– Easily get to 50% of monolingual IR
effectiveness– We can get to about 75% using:
• Part-of-speech tags• Pseudo-relevance feedback• Phrase indexing
• Multilingual training corpora are also useful– When the corpus is from the right domain
2009.05.06 - SLIDE 16IS 240 – Spring 2009
Related Issues
• Multiscript text processing [12]– Character sets, writing system, direction, ...
• Language identification [109]– Markup, detection
• Language-specific processing [103]– Stemming, morphological roots, compounds,
…
• Document translation [51]
2009.05.06 - SLIDE 17IS 240 – Spring 2009
Term-aligned Sentence-aligned Document-aligned Unaligned
Parallel Comparable
Knowledge-based Corpus-based
Controlled Vocabulary Free Text
Cross-Language Text Retrieval
Query Translation Document Translation
Text Translation Vector Translation
Ontology-based Dictionary-based
Thesaurus-based
2009.05.06 - SLIDE 18IS 240 – Spring 2009
Free Text Developments
• 1970, 1973 Salton– Hand coded bilingual dictionaries
• 1990 Latent Semantic Indexing [53]– French/English using Hansard training corpus
• 1994 European multilingual IR project [84]– Medium-scale recall/precision evaluation
• 1996 SIGIR Cross-lingual IR workshop– And many annual conferences and workshops
since!
2009.05.06 - SLIDE 19IS 240 – Spring 2009
How Controlled Vocabulary Works
• Thesaurus design [102]– Design a knowledge structure for domain– Assign a unique “descriptor” to each concept
• Include “scope notes” and “lead-in vocabulary”
• Document indexing– Read the document, assign appropriate
descriptors
• Retrieval– Select desired descriptors, use exact match
retrieval
2009.05.06 - SLIDE 20IS 240 – Spring 2009
Multilingual Thesauri
• Adapt the knowledge structure– Cultural differences influence indexing
choices
• Use language-independent descriptors– Matched to a unique term in each language
• Three construction techniques [46]– Build it from scratch– Translate an existing thesaurus– Merge monolingual thesauri
2009.05.06 - SLIDE 21IS 240 – Spring 2009
Advantages over Free Text
• High-quality concept-based indexing– Descriptors need not appear in the document
• Knowledge-guided searching– Good thesauri capture expert domain
knowledge
• Excellent cross-language effectiveness– Up to 100% of monolingual effectiveness
• Understandable retrieval results
• Efficient implementation
2009.05.06 - SLIDE 22IS 240 – Spring 2009
Limitations
• Costly to create– Design knowledge structure, index each
document
• Costly to maintain– Document indexing, vocabulary and concept
change
• Hard to use– Vocabulary choice, knowledge structure
navigation
• Limited scope– Domain must be chosen at design time
2009.05.06 - SLIDE 23IS 240 – Spring 2009
Query vs. Document Translation
• Query translation– Very efficient for short queries
• Not as big an advantage for relevance feedback
– Hard to resolve ambiguous query terms
• Document translation– May be needed by the selection interface
• And supports adaptive filtering well
– Slow, but only need to do it once per document
• Poor scale-up to large numbers of languages
2009.05.06 - SLIDE 24IS 240 – Spring 2009
Document Translation Example
• Approach– Select a single query language– Translate every document into that language– Perform monolingual retrieval
• Long documents provide enough context– And many translation errors do not hurt
retrieval
• Much of the generation effort is wasted– And choosing a single translation can hurt
2009.05.06 - SLIDE 25IS 240 – Spring 2009
Query Translation Example
• Select controlled vocabulary search terms
• Retrieve documents in desired language
• Form monolingual query from the documents
• Perform a monolingual free text search
Information Need
Thesaurus
ControlledVocabulary MultilingualText RetrievalSystem
Alta Vista
FrenchQueryTerms
EnglishAbstracts
English Web Pages
2009.05.06 - SLIDE 26IS 240 – Spring 2009
Machine Readable Dictionaries
• Based on printed bilingual dictionaries– Becoming widely available
• Used to produce bilingual term lists– Cross-language term mappings are
accessible• Sometimes listed in order of most common usage
– Some knowledge structure is also present• Hard to extract and represent automatically
• The challenge is to pick the right translation
2009.05.06 - SLIDE 27IS 240 – Spring 2009
Unconstrained Query Translation
• Replace each word with every translation– Typically 5-10 translations per word
• About 50% of monolingual effectiveness– Main problem is ambiguity– Example: Fly (English)
• 8 word senses (e.g., to fly a flag)• 13 Spanish translations (enarbolar, ondear, …)• 38 English retranslations (hoist, brandish, lift…)
2009.05.06 - SLIDE 28IS 240 – Spring 2009
Phrase Indexing
• Improves retrieval effectiveness two ways– Phrases are less ambiguous than single
words– Idiomatic phrases translate as a single
concept
• Three ways to identify phrases– Semantic (e.g., appears in a dictionary)– Syntactic (e.g., parse as a noun phrase)– Cooccurrence (words found together often)
• Semantic phrase results are impressive
2009.05.06 - SLIDE 29IS 240 – Spring 2009
Types of Bilingual Corpora
• Parallel corpora: translation-equivalent pairs– Document pairs– Sentence pairs – Term pairs
• Comparable corpora– Content-equivalent document pairs
• Unaligned corpora – Content from the same domain
2009.05.06 - SLIDE 30IS 240 – Spring 2009
Generating Parallel Corpora
• Parallel corpora are naturally domain-tuned– Finding one for the right domain may be hard
• Alternative is to build one– Start with a monolingual corpus– Automatic machine translation for second language
• Worthwhile when IR technique is faster than MT– If translation errors don’t hurt the IR technique
• Good results with Latent Semantic Indexing
2009.05.06 - SLIDE 31IS 240 – Spring 2009
Top ranked FrenchDocuments French
Text Retrieval System
Alta Vista
FrenchQueryTerms
EnglishTranslations
English Web Pages
ParallelCorpus
Pseudo-Relevance Feedback
• Enter query terms in French
• Find top French documents in parallel corpus
• Construct a query from English translations
• Perform a monolingual free text search
2009.05.06 - SLIDE 32IS 240 – Spring 2009
Similarity-Based Dictionaries
• Automatically developed from aligned documents– Reflects language use in a specific domain
• For each term, find most similar in other language– Retain only the top few (5 or so)
• Performs as well as dictionary-based techniques– Evaluated on a comparable corpus of news
stories [98]• Stories were automatically linked based on date and
subject
2009.05.06 - SLIDE 33IS 240 – Spring 2009
Latent Semantic Indexing
• Designed for better monolingual effectiveness– Works well across languages too [27]
• Cross-language is just a type of term choice variation
• Produces short dense document vectors– Better than long sparse ones for adaptive
filtering• Training data needs grow with dimensionality
– Not as good for retrieval efficiency• Always 300 multiplications, even for short queries
2009.05.06 - SLIDE 34IS 240 – Spring 2009
Cooccurrence-Based Dictionaries
• Align terms using cooccurrence statistics– How often do a term pair occur in sentence
pairs?• Weighted by relative position in the sentences
– Retain term pairs that occur unusually often
• Useful for query translation– Excellent results when the domain is the
same
• Also practical for document translation– Term use variations to reinforce good
translations
2009.05.06 - SLIDE 35IS 240 – Spring 2009
Language Identification
• Can be specified using metadata– Included in HTTP and HTML
• Determined using word-scale features– Which dictionary gets the most hits?
• Determined using subword features– Letter n-grams in electronic and printed text– Phoneme n-grams in speech
2009.05.06 - SLIDE 36IS 240 – Spring 2009
Research Directions
• User needs assessment
• Evaluation
• Corpus construction
• Word sense disambiguation
• System integration
• Probabilistic models
• Adaptive filtering
2009.05.06 - SLIDE 37IS 240 – Spring 2009
Evaluation
• Most critical need is for side by side tests– TREC-did this for French/German/Italian
• Domain shift metric– Domain shift hurts corpus-based techniques– Need a way to measure severity of the shift
• Test collections for adaptive filtering– From cross-language recall/precision
evaluation
2009.05.06 - SLIDE 38IS 240 – Spring 2009
Corpus Construction
• Corpus-based techniques have great potential
• Parallel corpora are rare and expensive– Find it, reverse engineer the links, clean it up
• Unlinked corpora are of limited value– Context linking research could change that [77]
• Comparable corpora offer middle ground– Need to develop automatic linking techniques– Also need a metric for degree of comparability
2009.05.06 - SLIDE 39IS 240 – Spring 2009
1
Find and Interpret Information
Vital to National Security
The Tamil National leader, Mr. V.Pirapaharan delivered a speech on 13 May 1998, theanniversary of the launch of Sri Lanka's biggest and longest assault on the Tamil homelands,describing how the LTTE defended against Sri Lanka's latest military ambitions. Here’s whathe said:
62Million people in South India and Sri Lanka can read this
•Find and retrieve informationin unfamiliar languages
•Translate it into English
•Extract and correlate its contentagainst other materials
TIDES
2009.05.06 - SLIDE 40IS 240 – Spring 2009
3
The Challenges
Today is a significant day in the history of ournational liberation struggle, it marks the end of a yearduring which we have resisted and fought against thebiggest ever offensive operation launched by the SriLankan armed forces code named "Jayasikuru”...
Translation
Topic Detection
Summarization
Extraction
The objective of the Sinhala chauvinists was to utilizemaximum man power and fire power to destroy themilitary capability of the LTTE and to bring an end tothe Tamil freedom movement. Before the launching ofthe operation "Jayasikuru" the Sri Lankan political andmilitary high command miscalculated the militarystrength and determination of the LTTE .
•Liberation Tigers of Tamil Eelam (LTTE)
•Sri Lanka
•Velupillai Pirapaharan
•Rebellion
Org Leader HQ LossesSinhala Kumaratunga 3000LTTE Pirapaharan Wanni 1300
(manual)
(experimental)
(special-purpose)
(key sentences)
Tamil document
Tamil document analysis
2009.05.06 - SLIDE 41IS 240 – Spring 2009
Cross-Language IR on the Web
• http://www.clis.umd.edu/dlrg/clir/– Most workshop proceedings– Lots of papers and project descriptions– Links to working systems
• Including 2 web search engines
– Useful linguistic resources– BibTeX for the attached bibliography
2009.05.06 - SLIDE 42IS 240 – Spring 2009
Today
• Review– NLP for IR– Text Summarization
• Cross-Language Information Retrieval– Introduction– Cross-Language EVIs
Credit for some of the material in this lecture goes to Doug Oard (University of Maryland)and to Fredric Gey and Aitao Chen
2009.05.06 - SLIDE 43IS 240 – Spring 2009
The “Entry Vocabulary Index” Approach to Multilingual Search
Ray R. Larson, Fredric Gey, Aitao Chen, Michael BucklandUniversity of California, Berkeley
School of Information Management and Systemsand UC Data
Harvesting Translingual Vocabulary Mappings for
Multilingual Digital Libraries
Note: This talk was presented at the 2002 JCDL (Joint Conference on Digital Libraries)
2009.05.06 - SLIDE 44IS 240 – Spring 2009
Overview
• What are Entry Vocabulary Indexes?– EVI Research at Berkeley– Notion of an EVI– How are EVIs Built
• Berkeley Multilingual EVI– Technology components– Database– Examples of operation
• Ongoing research
2009.05.06 - SLIDE 45IS 240 – Spring 2009
Entry Vocabulary Index Research Projects at Berkeley
• DARPA Information Management Program– “Search Support for Unfamiliar Metadata Vocabularies”
• Institute for Museum and Library Services– “Seamless Searching of Numeric and Textual
Resources”
• DARPA TIDES program– “Translingual Information Management Using Domain
Ontologies”
• NSF/NASA/DARPA: DLI-2 (IDL) – “ Discovery and Use of Textual, Numeric and Spatial
Data”
2009.05.06 - SLIDE 46IS 240 – Spring 2009
The IMLS project:To demonstrate improved access to written material and numerical data on the same topic when searching two very different databases:
--- books, articles, and their bibliographic records;
--- numerical data in socio-economic databases.
PHASE I: A library gateway providing search support for searching both text and socio-economic numeric databases. The gateway would accept a query in the library users’ own terms and would suggest what terms in the specialized categorization used in the resource to be searched.
PHASE II: Demonstration of a library gateway supporting searches between text and numeric databases. If you found some thing interesting in a socio-economic database, the gateway would help you to find documents on the same topic in a text database – and vice versa.
2009.05.06 - SLIDE 47IS 240 – Spring 2009
TIDES Project
• Translingual Information Detection, Extraction and Summarization– Building EVIs to map across languages
• Using same notion with training data in different languages
• Using Library of Congress Subject Headings from the CDL MELVYL database
2009.05.06 - SLIDE 48IS 240 – Spring 2009
What is an Entry Vocabulary Index?
• EVIs are a means of mapping from user’s vocabulary to the controlled vocabulary of a collection of documents…
2009.05.06 - SLIDE 50IS 240 – Spring 2009
Classify and index with controlled
vocabulary.Index
Ideally, use a database
already indexed
2009.05.06 - SLIDE 51IS 240 – Spring 2009
Problem:Controlled
Vocabularies can be
difficult for people to
use.“pass mtr veh spark ign eng”
Index
2009.05.06 - SLIDE 52IS 240 – Spring 2009
Solution:Entry Level Vocabulary
Indexes.Index
EVIpass mtr veh
spark ign eng”
= “Automobile”
2009.05.06 - SLIDE 53IS 240 – Spring 2009
EVI exampleEVI 1
Index term:“pass mtr veh spark ign eng”User
Query “Automobile
” EVI 2Index term:“automobiles”OR
“internal combustible engines”
2009.05.06 - SLIDE 57IS 240 – Spring 2009
FindPlutonium
In Arabic Chinese Greek Japanese Korean Russian Tamil
Why not treat language the same way?
2009.05.06 - SLIDE 58IS 240 – Spring 2009
FindPlutonium
In Arabic Chinese Greek Japanese Korean Russian Tamil
...),,2[logL(p t)W(c, 1 baaStatistical association
Digital library resources
2009.05.06 - SLIDE 59IS 240 – Spring 2009
Background on Online Library Catalogs
• Library catalogs have been automated at a furious pace worldwide since the late ’70s
• Library objects (books, maps, pictures) in 400+ languages
• Bibliographic descriptions contain one or more sentences from a particular language (transliterated)
• Objects have been classified by subject by librarians– Library of Congress Subject Heading (Islamic Fundamentalism)– Library of Congress Classification (BP60, BP63, KF27)– Dewey Decimal Classification (297.2, 306.6, 320.5)
• International standard (MARC) for catalog metadata• Huge number of remotely searchable catalogs worldwide
accessible using the international search/retrieve protocol Z39.50
2009.05.06 - SLIDE 60IS 240 – Spring 2009
What can libraries and their catalogs provide?
• Millions of sentences in multiple languages• Sentences with topical content identified from
150,000 Library of Congress Subject Headings• Transfer point (interlingua) between English
topics and words in other languages• Can be used to create:
– Bilingual dictionaries– Query expansion in cross-language information
retrieval
2009.05.06 - SLIDE 61IS 240 – Spring 2009
Search: SUBJECT “Islamic Fundamentalism” and LANGUAGE “Arabic”
Yield: 119 Arabic language samples on topic “Islamic Fundamentalism”
2009.05.06 - SLIDE 62IS 240 – Spring 2009
Our Training Set and Prototype
• University of California/CDL MELVYL catalog• Private copy, 10 million+ records (5 million non-
English)• Records in over 100 languages• Obtained in MARC database standard format• Foreign language titles use Library of Congress
transliteration (Romanization) standard• Prototype search software maps from/to English
and– Arabic, Chinese, French, German– Italian, Japanese, Russian, Spanish
2009.05.06 - SLIDE 63IS 240 – Spring 2009
Technical Details
Download a set of
training data.
Build associations between extracted terms & controlled
vocabularies.
Part of speech tagging
Extract terms (words and noun
phrases) from titles and abstracts.
Building an Entry Vocabulary Module (EVI)
For noun phrases
Internet DB indexed with a
controlled vocabulary.
2009.05.06 - SLIDE 64IS 240 – Spring 2009
Association Measure
C ¬Ct a b¬t c d
Where t is the occurrence of a term and C is the occurrence of a classification in the training set
2009.05.06 - SLIDE 65IS 240 – Spring 2009
Association Measure
• Maximum Likelihood ratio
W(C,t) = 2[logL(p1,a,a+b) + logL(p2,c,c+d) - logL(p,a,a+b) – logL(p,c,c+d)] where logL(p,n,k) = klog(p) + (n – k)log(1- p)
and p1= p2= p=
a a+b
c c+d
a+c a+b+c+d
Vis. Dunning
2009.05.06 - SLIDE 66IS 240 – Spring 2009
Example: Library of Congress Subject Heading “Islamic Fundamentalism” yields most closely associated words in multiple languages
2009.05.06 - SLIDE 67IS 240 – Spring 2009
Non-English words can be mapped to English subject headings
2009.05.06 - SLIDE 69IS 240 – Spring 2009
Catalog Languages vs. FBIS Languages (University of California online catalog. 10 million records) Approx. language distribution (Berkeley # sentences, FBIS est. # lines source)
Berkeley FBIS Berkeley FBIS
German 840,032 49,872 Danish 41,517 18,688Spanish 614,025 388,772 Hebrew 41,468 3,500French 609,089 2,871 Czech 35,432 3,647Russian 341,050 15,415 Urdu 30,206
Italian 266,424 254 Turkish 30,015
Portuguese 149,389 24,930 Bulgarian 27,850
Chinese 127,636 246,549 Norwegian 26,478 13,596Japanese 110,956 Korean 25,979 68,607Arabic 96,124 (8263)* Rumanian 25,874
Dutch 90,170 Finnish 25,027 8,187Latin 88,818 Thai 24,693
Polish 81,698 Serbo-Croatian 24,601 36,139Indonesian 59,445 Greek 23,926
Swedish 53,854 16,652 Bengali 23,430
Hungarian 46,330 6,631 Catalan 20,392
Hindi 42,886 Tamil 20,232
*English only, no source text
106 languages with > 500 records
2009.05.06 - SLIDE 70IS 240 – Spring 2009
Future Research
• Add content from other online library catalogs– RLIN (>30M records, >900K Chinese, >250K Arabic) – COPAC [UK] (9M records, 40k Arabic)
• Transliteration and back-transliteration for scripted languages
• Phrase mapping (POS tagging for English, bigram-trigram identification for target languages using mutual information)
• Further evaluation (TREC, CLEF, NCTIR and local analysis)