leveraging reusability: cost-effective lexical acquisition for large-scale ontology translation g....

25
Leveraging Reusability: Leveraging Reusability: Cost-effective Lexical Cost-effective Lexical Acquisition for Large-scale Acquisition for Large-scale Ontology Translation Ontology Translation G. Craig Murray et al. G. Craig Murray et al. COLING 2006 COLING 2006 Reporter Yong-Xiang Chen Reporter Yong-Xiang Chen

Upload: marian-cobb

Post on 31-Dec-2015

224 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Leveraging Reusability: Cost-effective Lexical Acquisition for Large-scale Ontology Translation G. Craig Murray et al. COLING 2006 Reporter Yong-Xiang

Leveraging Reusability: Leveraging Reusability: Cost-effective Lexical Cost-effective Lexical Acquisition for Large-scale Acquisition for Large-scale Ontology TranslationOntology Translation

G. Craig Murray et al.G. Craig Murray et al.COLING 2006COLING 2006Reporter Yong-Xiang ChenReporter Yong-Xiang Chen

Page 2: Leveraging Reusability: Cost-effective Lexical Acquisition for Large-scale Ontology Translation G. Craig Murray et al. COLING 2006 Reporter Yong-Xiang

Background and Background and ProblemProblem Thesauri and ontologies provide important Thesauri and ontologies provide important

value in value in facilitating access to digital archivefacilitating access to digital archivess by by – representing underlying principles of representing underlying principles of organizatiorganizati

onon Translation of such resources into multiple Translation of such resources into multiple

languageslanguages is an important component is an important component– Specificity of vocabulary terms in most ontologSpecificity of vocabulary terms in most ontolog

ies precludes fully-automated machine translaties precludes fully-automated machine translation using ion using general-domain lexical resourcesgeneral-domain lexical resources

Page 3: Leveraging Reusability: Cost-effective Lexical Acquisition for Large-scale Ontology Translation G. Craig Murray et al. COLING 2006 Reporter Yong-Xiang

Research ApproachResearch Approach

Present an efficient process for Present an efficient process for leveraging leveraging human translationshuman translations when when constructing domain-specific lexical constructing domain-specific lexical resourcesresources

Evaluate the effectiveness of this Evaluate the effectiveness of this process by producing a process by producing a probabilistic probabilistic phrase dictionaryphrase dictionary and translating a and translating a thesaurus of 56,000 concepts used to thesaurus of 56,000 concepts used to catalogue a large archive of oral catalogue a large archive of oral historieshistories

Page 4: Leveraging Reusability: Cost-effective Lexical Acquisition for Large-scale Ontology Translation G. Craig Murray et al. COLING 2006 Reporter Yong-Xiang

PurposePurpose

If we need humans to assist in the If we need humans to assist in the translation process, how can we translation process, how can we maximize access while minimizing maximize access while minimizing cost?cost?

Reuse!!Reuse!!Useful First!!Useful First!!

Page 5: Leveraging Reusability: Cost-effective Lexical Acquisition for Large-scale Ontology Translation G. Craig Murray et al. COLING 2006 Reporter Yong-Xiang

Specific ProblemSpecific Problem

MostMost digital collections of any digital collections of any significant size significant size use a system of use a system of organizationorganization that facilitates easy that facilitates easy access to collection contentsaccess to collection contents– The organizing principles are captured in The organizing principles are captured in

the form of a the form of a controlled vocabularycontrolled vocabulary of of keyword phraseskeyword phrases

– Usually arranged in Usually arranged in a hierarchic thesaurusa hierarchic thesaurus or ontologyor ontology

Page 6: Leveraging Reusability: Cost-effective Lexical Acquisition for Large-scale Ontology Translation G. Craig Murray et al. COLING 2006 Reporter Yong-Xiang

IdeaIdea

Collected 3,000 manual translations of Collected 3,000 manual translations of keyword phrases keyword phrases

Reused the translated termsReused the translated terms to to generate a lexicongenerate a lexicon for for automated automated translationtranslation of the rest of the thesaurus of the rest of the thesaurus

Priority is in terms ofPriority is in terms of– value in value in accessing the collectionaccessing the collection– the reusability of their the reusability of their component termscomponent terms

Page 7: Leveraging Reusability: Cost-effective Lexical Acquisition for Large-scale Ontology Translation G. Craig Murray et al. COLING 2006 Reporter Yong-Xiang

Checked and Aligned Checked and Aligned the Translationsthe Translations Translations collected from one humTranslations collected from one hum

an informantan informant Checked and aligned to the original Checked and aligned to the original

English terms by a English terms by a second informantsecond informant Induce aInduce a probabilistic probabilistic English-Czech English-Czech

phrase dictionaryphrase dictionary

Page 8: Leveraging Reusability: Cost-effective Lexical Acquisition for Large-scale Ontology Translation G. Craig Murray et al. COLING 2006 Reporter Yong-Xiang

Maximizing Value and Maximizing Value and ReusabilityReusability To quantify the utility of 3,000 manuTo quantify the utility of 3,000 manu

al translations of keyword phrasesal translations of keyword phrases– Define Define two two values forvalues for each keyword each keyword

phrase in the thesaurusphrase in the thesaurus Thesaurus valueThesaurus value, ,

– representing representing the importance of the keyword phrthe importance of the keyword phrasease for providing access to the collection for providing access to the collection

Translation valueTranslation value– representing the usefulness of having the keyworepresenting the usefulness of having the keywo

rd phrase translatedrd phrase translated The second is related to the firstThe second is related to the first

Page 9: Leveraging Reusability: Cost-effective Lexical Acquisition for Large-scale Ontology Translation G. Craig Murray et al. COLING 2006 Reporter Yong-Xiang

Keyword hierarchy Keyword hierarchy

Page 10: Leveraging Reusability: Cost-effective Lexical Acquisition for Large-scale Ontology Translation G. Craig Murray et al. COLING 2006 Reporter Yong-Xiang

Meaning of nodesMeaning of nodes

Internal (non-leaf) nodesInternal (non-leaf) nodes of the hiera of the hierarchy are used to organize concepts archy are used to organize concepts and support concept browsing nd support concept browsing

Leaf nodesLeaf nodes are very specific and are are very specific and are only used to index video content only used to index video content

Page 11: Leveraging Reusability: Cost-effective Lexical Acquisition for Large-scale Ontology Translation G. Craig Murray et al. COLING 2006 Reporter Yong-Xiang

Example to Example to Thesaurus valueThesaurus value

The keyword phrase “The keyword phrase “Auschwitz IIBirkenauAuschwitz IIBirkenau (Poland: (Poland: Death Camp)” Death Camp)”– Which describes a Nazi death camp Which describes a Nazi death camp – Assigned to 17,555 video segments in the collection Assigned to 17,555 video segments in the collection – Has Has broader (parent) termsbroader (parent) terms and and narrower (child) termsnarrower (child) terms

““German death camps” is German death camps” is not assigned to any video segmentsnot assigned to any video segments However, “German death camps” However, “German death camps” has very important narrower has very important narrower

termsterms including “Auschwitz II-Birkenau” and others including “Auschwitz II-Birkenau” and others

Page 12: Leveraging Reusability: Cost-effective Lexical Acquisition for Large-scale Ontology Translation G. Craig Murray et al. COLING 2006 Reporter Yong-Xiang

Thesaurus valueThesaurus value

Represents the importance of each Represents the importance of each keyword phrase to the thesaurus keyword phrase to the thesaurus

An internal node is valuable in An internal node is valuable in providing access to its children providing access to its children

But value a node by the sum value of But value a node by the sum value of all its children, grandchildren, etc., all its children, grandchildren, etc., the the resulting calculation would biasresulting calculation would bias the the top top of the hierarchy of the hierarchy

Page 13: Leveraging Reusability: Cost-effective Lexical Acquisition for Large-scale Ontology Translation G. Craig Murray et al. COLING 2006 Reporter Yong-Xiang

Thesaurus valueThesaurus value

Leaf node:Leaf node: the number of video the number of video segments to which the concept has segments to which the concept has been assigned been assigned

Parent node:Parent node: plus plus the average of the the average of the thesaurus value thesaurus value of any child nodes of any child nodes

The final values quantify The final values quantify how valuable how valuable the translation of any given keyword the translation of any given keyword phrasephrase would be in providing access to would be in providing access to video segments video segments

Page 14: Leveraging Reusability: Cost-effective Lexical Acquisition for Large-scale Ontology Translation G. Craig Murray et al. COLING 2006 Reporter Yong-Xiang

Translation valueTranslation value

Compute the Compute the translation value translation value for for each word in the vocabulary as each word in the vocabulary as the the sum of the thesaurus valuesum of the thesaurus value for every keyword phrase that for every keyword phrase that contains that word contains that word

Page 15: Leveraging Reusability: Cost-effective Lexical Acquisition for Large-scale Ontology Translation G. Craig Murray et al. COLING 2006 Reporter Yong-Xiang

Use these valuesUse these values

The end result is The end result is a list of vocabulary a list of vocabulary wordswords and the and the impact that correct impact that correct translationtranslation of each word would have on of each word would have on the overall value of the translated the overall value of the translated thesaurus thesaurus

We elicited human translations of We elicited human translations of entire keyword phrasesentire keyword phrases rather than rather than individual vocabulary terms individual vocabulary terms

Page 16: Leveraging Reusability: Cost-effective Lexical Acquisition for Large-scale Ontology Translation G. Craig Murray et al. COLING 2006 Reporter Yong-Xiang

PrioritizePrioritize

The value gained by translating any given pThe value gained by translating any given phrase is hrase is more accuratelymore accurately estimated by the estimated by the ttotal valueotal value of any untranslated words it cont of any untranslated words it contains ains

Prioritized the order of keyword phrase tranPrioritized the order of keyword phrase translations based on the slations based on the translation value translation value of thof the untranslated words in each keyword phrae untranslated words in each keyword phrase se

Prioritizing their translation based on the asPrioritizing their translation based on the assumption that any words contained in a keysumption that any words contained in a keyword phrase of higher priority would alreadword phrase of higher priority would already have been translated y have been translated

Page 17: Leveraging Reusability: Cost-effective Lexical Acquisition for Large-scale Ontology Translation G. Craig Murray et al. COLING 2006 Reporter Yong-Xiang

Alignment Alignment

Obtained professional translations for Obtained professional translations for the the top 3000 Englishtop 3000 English keyword phrases keyword phrases

This second informant: This second informant: – tokenized these translationstokenized these translations and and

presented them to another bilingual Czech presented them to another bilingual Czech speaker for verification and alignment speaker for verification and alignment

Alignment process was then used to Alignment process was then used to build a probabilistic dictionarybuild a probabilistic dictionary of words of words and phrases and phrases

Page 18: Leveraging Reusability: Cost-effective Lexical Acquisition for Large-scale Ontology Translation G. Craig Murray et al. COLING 2006 Reporter Yong-Xiang

Example of alignmentExample of alignment

Page 19: Leveraging Reusability: Cost-effective Lexical Acquisition for Large-scale Ontology Translation G. Craig Murray et al. COLING 2006 Reporter Yong-Xiang

Machine Translation Machine Translation

It first scans the English input to It first scans the English input to find the lfind the longest matching substringongest matching substring in our dictiona in our dictionary, and replaces it with the most likely Czery, and replaces it with the most likely Czech translationch translation

Looks up “monasteries and convents stillLooks up “monasteries and convents stills” in the dictionarys” in the dictionary– finds no translation,finds no translation,

backs off to “monasteries and conventsbacks off to “monasteries and convents””– translated to “klás@tery” translated to “klás@tery”

Page 20: Leveraging Reusability: Cost-effective Lexical Acquisition for Large-scale Ontology Translation G. Craig Murray et al. COLING 2006 Reporter Yong-Xiang

Gain rate of access value

Page 21: Leveraging Reusability: Cost-effective Lexical Acquisition for Large-scale Ontology Translation G. Craig Murray et al. COLING 2006 Reporter Yong-Xiang

ExperimentExperiment

MALACH projectMALACH project– an NSF-funded effort to improve multilingual an NSF-funded effort to improve multilingual

information access to large archives of information access to large archives of spoken spoken languagelanguage

Leverages a small set of manually acquired Leverages a small set of manually acquired English-Czech translations to translate a English-Czech translations to translate a large ontology of keyword phraseslarge ontology of keyword phrases

Page 22: Leveraging Reusability: Cost-effective Lexical Acquisition for Large-scale Ontology Translation G. Craig Murray et al. COLING 2006 Reporter Yong-Xiang

Evaluation Evaluation

Compared our system output to Compared our system output to huhuman reference translationsman reference translations using Ble using Bleu (Papineni, et al., 2002)u (Papineni, et al., 2002)

Showed corrected and uncorrected Showed corrected and uncorrected machine translations to Czech speakmachine translations to Czech speakers and collected subjective judgmeers and collected subjective judgments of nts of fluencyfluency and and accuracyaccuracy

Page 23: Leveraging Reusability: Cost-effective Lexical Acquisition for Large-scale Ontology Translation G. Craig Murray et al. COLING 2006 Reporter Yong-Xiang

Bleu ScoresBleu Scores

Page 24: Leveraging Reusability: Cost-effective Lexical Acquisition for Large-scale Ontology Translation G. Craig Murray et al. COLING 2006 Reporter Yong-Xiang

Subjective Judgment Subjective Judgment ScoreScore selected 418 keyword phrases to selected 418 keyword phrases to

be used as target translations be used as target translations

Page 25: Leveraging Reusability: Cost-effective Lexical Acquisition for Large-scale Ontology Translation G. Craig Murray et al. COLING 2006 Reporter Yong-Xiang

ConclusionConclusion

Demonstrate that prioritization Demonstrate that prioritization based on hierarchical position and based on hierarchical position and frequency of use facilitates frequency of use facilitates extremely efficient reuse of extremely efficient reuse of human inputhuman input

Evaluations show that our Evaluations show that our technique boost performance of a technique boost performance of a simple translation system by 65%.simple translation system by 65%.