towards the use of linguistic information in automatic mt evaluation metrics

Towards the Use of Linguistic Information in Automatic MT Evaluation

Metrics

Projecte de Tesi

Elisabet Comelles

Directores Irene Castellon i Victoria Arranz

Outline

• Introduction

• State of the Art

• Discussion of MT Evaluation Metrics

• Hypothesis & Objective

• Methodology & Schedule

Introduction

• Quickly access to Multilingual Information

• Need for quick translation

• High increase of MT Systems

• Need for evaluation of those MT Systems

• Evaluation needs to be quick and reliable

Introduction

• Current and most used Evaluation Metrics show problems

• New approaches to Evaluation using linguistic information:– Syntactic info

– Semantic info

• Our scenario:– Comparisson between already existing systems

– Direction of translation to test: English-Spanish

State of the Art

• MT absolutely linked to MT Evaluation

• Purpose of the evaluation methods:– Error analysis– System comparisson

• Chronologically:1. Human MT Evaluation

2. Automatic MT Evaluation

State of the ArtTypes of MT Evaluation

• Focused on Context: – Context-based Evaluation (FEMTI)

• Evaluates suitability of the MT Technology & the MT System for the user’s purpose

• Parameters of analysis: functionality, reliability, usabiility, efficiency, maintainability, portability, cost, etc.

• Focused on Quantitiy & Quality: – Human Evaluation and Automatic Evaluation


• Human Evaluation:– Several approaches:

• Fidelity (ALPAC report)• Intelligibility (ALPAC report)• Comprehensive evaluation of informativeness

(ARPA)• Quality panel evaluation• Adequacy and Fluency (Semantics and Syntax)• Preferred Translation• Required Post-Editing


• Human Evaluation:– Advantage: human evaluators can evaluate

the overall qualitiy of the system– Disadvantages:

• Time-consuming• Expensive• Subjective


• Automatic Evaluation:– Approaches:

• Based on Lexical Matching• Based on Syntax• Based on Semantics


• Based on Lexical Matching:– Dominant approach to Automatic MT

Evaluation– Seeks for lexical similarities between MT

output and reference translations– Types:

• Edit Distance Measures (WER)• Precision-oriented Measures (BLEU)• Recall-oriented Measures (ROUGE)• Measure balancing Precision & Recall (GTM)


• Based on Syntax– Recently developed– Focused on the syntax of the output sentence– Types:

• Constituency Parsing• Dependency Parsing• Combination of both analyses (Liu & Gildea 2005)


• Based on Semantics:– Recently developed– Focused on the semantics of the output level– Types:

• NEs: Quality over NEs (NEE)• Semantic Roles: Similarities over Semantic Roles

(SR)

Discussion of MT evaluation Metrics

• Human Evaluation:– Advantatges:

• Allow to evaluate overall quality

– Disadvantatges:• Time-consuming• Expensive• Subjective

Discussion of MT Evaluation Metrics

• Automatic Evaluation:– Advantages:

• Fast• Not expensive• Objective• Updatable

– Disadvantages?


• Automatic Metrics based on Lexical Matching:– Great advance in MT Research in the last decade– Widely accepted & used by the SMT research

community– BLEU is the most used Automatic Metric– Criticized by those not developing SMT systems– Usually depend on translation references– Only take into account lexical similarities &

disregard syntax– Biased


• Automatic Metrics based on Syntax:– Good improvement– Works at sentence level– Only focused on Syntax– What about meaning?

• Automatic metrics based on Semantics:– Good improvement– Only NEs & Semantic Roles– NEs not too relevant– Need further development– Only focused on meaning, what about syntax?


• Discussion of Automatic Metrics:– Each metric focuses on a partial aspect of

qualityStrongly biased evaluationsUnfair comparisson between systemsOvertuning of the system

− Need for integration of metrics• Parametric vs. Non-parametric• Evaluation of the quality of a metric combination

Human likeness Human acceptability

Hypothesis & Objective

• Hypothesis:Adding new linguistic information will improve

the performance of Automatic Metrics

• Main Objective:Proposing a new Automatic Evaluation Metric

based on linguistic information.

Hypothesis & Objective

• Secondary Objectives:– Explore linguistic information:

• Syntactic info: POS, shallow parsing, chunking, full parsing, dependency parsing, constituency parsing, etc.

• Semantic info: Semantic Roles, semantic features, Wordnet, Framenet, Lexical Semantics, etc.

– Look for linguistic resources appropriate to be computationally processed

– Look for linguistic resources publicly available– Explore the appropriate way to combine this

information

Methodology & Schedule

• 4 stages:– Stage 1 (year 1 & 2):

• Bibliography research and analysis:– Detailed exploration and analysis of Automatic

Evaluation Metrics– Detailed exploration, analysis and selection of the

adequate linguistic information.– Exploration of the feasibility and availability of the

linguistic resources needed

– Stage 2 (year 1 & 2):• Selection of the Corpus of evaluation

Methodology & Schedule

– Stage 3 (year 3):• Experiments on how to combine this linguistic

information and the automatic evaluation metrics• Evaluation of our metric combination based on

either likeness or acceptability.

– Stage 4 (year 4):• Analysis & discussion of the results obtained• Summary of the findings and reflection on the

results obtained• Proposal of a new evaluation metric

towards the use of linguistic information in automatic mt evaluation metrics

Documents

mt output

mt technology

mt systemsneed

mt evaluationpurpose

mt systemsevaluation

arttypes of mt evaluationbased

evaluation methods

automatic mt evaluationseeks