The Contribution of Linguistic Features to Automatic Machine Translation Evaluation
Amigó, Enrique and Giménez, Jesús and Gonzalo, Julio and Verdejo, Felisa

Article Structure

Abstract

A number of approaches to Automatic MT Evaluation based on deep linguistic knowledge have been suggested.

Introduction

Automatic evaluation methods based on similarity to human references have substantially accelerated the development cycle of many NLP tasks, such as Machine Translation, Automatic Summarization, Sentence Compression and Language Generation.

Previous Work on Machine Translation Meta-Evaluation

Insofar as automatic evaluation metrics for machine translation have been proposed, different meta-evaluation frameworks have been gradually introduced.

Metrics and Test Beds

3.1 Metric Set

Correlation with Human Judgements

4.1 Correlation at the Segment vs. System Levels

Alternatives to Correlation-based Meta-evaluation

We have seen that correlation with human judgements has serious limitations for metric evaluation.

Conclusions

Our experiments show that, on one hand, traditional n-gram based metrics are more or equally

Topics

n-gram

Appears in 23 sentences as: N-gram (3) n-gram (21)
In The Contribution of Linguistic Features to Automatic Machine Translation Evaluation
  1. However, n-gram based metrics are still today the dominant approach.
    Page 1, “Abstract”
  2. However, the most commonly used metrics are still based on n-gram matching.
    Page 1, “Introduction”
  3. For that purpose, we compare — using four different test beds — the performance of 16 n-gram based metrics, 48 linguistic metrics and one combined metric from the state of the art.
    Page 1, “Introduction”
  4. All that metrics were based on n-gram overlap.
    Page 2, “Previous Work on Machine Translation Meta-Evaluation”
  5. Let us first analyze the correlation with human judgements for linguistic vs. n-gram based metrics.
    Page 3, “Correlation with Human Judgements”
  6. Linguistic metrics are represented by grey plots, and black plots represent metrics based on n-gram overlap.
    Page 3, “Correlation with Human Judgements”
  7. Therefore, we need additional meta-evaluation criteria in order to clarify the behavior of linguistic metrics as compared to n-gram based metrics.
    Page 3, “Correlation with Human Judgements”
  8. ° Combining l N-gram 9 Linguistic
    Page 3, “Correlation with Human Judgements”
  9. Therefore, we have focused on other aspects of metric reliability that have revealed differences between n-gram and linguistic based metrics:
    Page 4, “Alternatives to Correlation-based Meta-evaluation”
  10. All n-gram based metrics achieve SIP and SIR values between 0.8 and 0.9.
    Page 5, “Alternatives to Correlation-based Meta-evaluation”
  11. This result suggests that n-gram based metrics are reasonably reliable for this purpose.
    Page 5, “Alternatives to Correlation-based Meta-evaluation”

See all papers in Proc. ACL 2009 that mention n-gram.

See all papers in Proc. ACL that mention n-gram.

Back to top.

human judgements

Appears in 19 sentences as: Human judgements (1) human judgements (13) human judges (5)
In The Contribution of Linguistic Features to Automatic Machine Translation Evaluation
  1. In this work, we propose a novel approach for meta-evaluation of MT evaluation metrics, since correlation cofficient against human judges do not reveal details about the advantages and disadvantages of particular metrics.
    Page 1, “Abstract”
  2. In this respect, we identify important drawbacks of the standard meta-evaluation methods based on correlation with human judgements .
    Page 1, “Introduction”
  3. In order to address this issue, they computed the translation-by-translation correlation with human judgements (i.e., correlation at the segment level).
    Page 2, “Previous Work on Machine Translation Meta-Evaluation”
  4. In all these cases, metrics were also evaluated by means of correlation with human judgements .
    Page 2, “Previous Work on Machine Translation Meta-Evaluation”
  5. Most approaches again rely on correlation with human judgements .
    Page 2, “Previous Work on Machine Translation Meta-Evaluation”
  6. Human assessments of adequacy and fluency, on a 1-5 scale, are available for a subset of sentences, each evaluated by two different human judges .
    Page 3, “Metrics and Test Beds”
  7. Let us first analyze the correlation with human judgements for linguistic vs. n-gram based metrics.
    Page 3, “Correlation with Human Judgements”
  8. Although correlation with human judgements is considered the standard meta-evaluation criterion, it presents serious drawbacks.
    Page 3, “Correlation with Human Judgements”
  9. For instance, Table 2 shows the best 10 metrics in CEOS according to their correlation with human judges at the system level, and then the ranking they obtain in the AEOS testbed.
    Page 3, “Correlation with Human Judgements”
  10. Table 2: Metrics rankings according to correlation with human judgements using CE05 vs. AE05
    Page 4, “Correlation with Human Judgements”
  11. Figure 2: Human judgements and scores of two hypothetical metrics with Pearson correlation 0.5
    Page 4, “Correlation with Human Judgements”

See all papers in Proc. ACL 2009 that mention human judgements.

See all papers in Proc. ACL that mention human judgements.

Back to top.

evaluation metrics

Appears in 12 sentences as: evaluation metric (4) evaluation metrics (8)
In The Contribution of Linguistic Features to Automatic Machine Translation Evaluation
  1. In this work, we propose a novel approach for meta-evaluation of MT evaluation metrics , since correlation cofficient against human judges do not reveal details about the advantages and disadvantages of particular metrics.
    Page 1, “Abstract”
  2. We then use this approach to investigate the benefits of introducing linguistic features into evaluation metrics .
    Page 1, “Abstract”
  3. These automatic evaluation metrics allow developers to optimize their systems without the need for expensive human assessments for each of their possible system configurations.
    Page 1, “Introduction”
  4. context of Machine Translation, a considerable effort has also been made to include deeper linguistic information in automatic evaluation metrics , both syntactic and semantic (see Section 2 for details).
    Page 1, “Introduction”
  5. Analyzing the reliability of evaluation metrics requires meta-evaluation criteria.
    Page 1, “Introduction”
  6. Insofar as automatic evaluation metrics for machine translation have been proposed, different meta-evaluation frameworks have been gradually introduced.
    Page 2, “Previous Work on Machine Translation Meta-Evaluation”
  7. Figure 1 shows the correlation obtained by each automatic evaluation metric at system level (horizontal axis) versus segment level (vertical axis) in our test beds.
    Page 3, “Correlation with Human Judgements”
  8. However, each automatic evaluation metric has its own scale properties.
    Page 5, “Alternatives to Correlation-based Meta-evaluation”
  9. This conclusion motivates the incorporation of linguistic processing into automatic evaluation metrics .
    Page 6, “Alternatives to Correlation-based Meta-evaluation”
  10. In order to obtain additional evidence about the usefulness of combining evaluation metrics at different processing levels, let us consider the following situation: given a set of reference translations we want to train a combined system that takes the most appropriate translation approach for each text segment.
    Page 7, “Alternatives to Correlation-based Meta-evaluation”
  11. predictive power of the employed automatic evaluation metric .
    Page 7, “Alternatives to Correlation-based Meta-evaluation”

See all papers in Proc. ACL 2009 that mention evaluation metrics.

See all papers in Proc. ACL that mention evaluation metrics.

Back to top.

significant improvements

Appears in 12 sentences as: Significant Improvement (3) Significant improvement (1) significant improvement (3) significant improvements (6) significantly improves (1)
In The Contribution of Linguistic Features to Automatic Machine Translation Evaluation
  1. PreCISIon of Significant improvement prediction
    Page 4, “Alternatives to Correlation-based Meta-evaluation”
  2. Recall of Significant Improvement
    Page 4, “Alternatives to Correlation-based Meta-evaluation”
  3. We now investigate to what extent a significant system improvement according to the metric implies a significant improvement according to human assessors, and Viceversa.
    Page 4, “Alternatives to Correlation-based Meta-evaluation”
  4. In order to tackle this issue, we compare metrics versus human assessments in terms of precision and recall over statistically significant improvements within all system pairs in the test beds.
    Page 4, “Alternatives to Correlation-based Meta-evaluation”
  5. First, Table 3 shows the amount of significant improvements over human judgements according to the Wilcoxon statistical significant test (04 g 0.025).
    Page 4, “Alternatives to Correlation-based Meta-evaluation”
  6. 45 system pairs; from these, in 40 cases (rightmost column) one of the systems significantly improves the other.
    Page 4, “Alternatives to Correlation-based Meta-evaluation”
  7. Based on these data, we define two meta-metrics: Significant Improvement Precision (SIP) and Significant Improvement Recall (SIR).
    Page 4, “Alternatives to Correlation-based Meta-evaluation”
  8. SIR (recall) represents to what extent the metric is able to cover the significant improvements detected by humans.
    Page 5, “Alternatives to Correlation-based Meta-evaluation”
  9. Let I h be the set of significant improvements detected by human assessors and Im the set detected by the metric m. Then:
    Page 5, “Alternatives to Correlation-based Meta-evaluation”
  10. Given that linguistic metrics require matching translation with references at additional linguistic levels, the significant improvements detected are more reliable (higher precision or SIP), but at the cost of recall over real significant improvements (lower SIR).
    Page 5, “Alternatives to Correlation-based Meta-evaluation”
  11. 3Notice that we just have 75 significant improvement samples, so small differences in SIP or SIR have no relevance
    Page 5, “Alternatives to Correlation-based Meta-evaluation”

See all papers in Proc. ACL 2009 that mention significant improvements.

See all papers in Proc. ACL that mention significant improvements.

Back to top.

NIST

Appears in 7 sentences as: NIST (6) nist (1)
In The Contribution of Linguistic Features to Automatic Machine Translation Evaluation
  1. With the aim of overcoming some of the deficiencies of BLEU, Doddington (2002) introduced the NIST metric.
    Page 2, “Previous Work on Machine Translation Meta-Evaluation”
  2. Lin and Och (2004) experimented, unlike previous works, with a wide set of metrics, including NIST , WER (NieBen et al., 2000), PER (Tillmann et al., 1997), and variants of ROUGE, BLEU and GTM.
    Page 2, “Previous Work on Machine Translation Meta-Evaluation”
  3. At the lexical level, we have included several standard metrics, based on different similarity assumptions: edit distance (WER, PER and TER), lexical precision (BLEU and NIST ), lexical recall (ROUGE), and F-measure (GTM and METEOR).
    Page 2, “Metrics and Test Beds”
  4. Table 1: NIST 2004/2005 MT Evaluation Campaigns.
    Page 3, “Metrics and Test Beds”
  5. We use the test beds from the 2004 and 2005 NIST MT Evaluation Campaigns (Le and Przy-bocki, 2005)2.
    Page 3, “Metrics and Test Beds”
  6. nist .
    Page 3, “Correlation with Human Judgements”
  7. NIST 5 .70 randOST 5 .20 minOST 3.67
    Page 7, “Alternatives to Correlation-based Meta-evaluation”

See all papers in Proc. ACL 2009 that mention NIST.

See all papers in Proc. ACL that mention NIST.

Back to top.

BLEU

Appears in 5 sentences as: BLEU (5)
In The Contribution of Linguistic Features to Automatic Machine Translation Evaluation
  1. (2001) introduced the BLEU metric and evaluated its reliability in terms of Pearson correlation with human assessments for adequacy and fluency judgements.
    Page 2, “Previous Work on Machine Translation Meta-Evaluation”
  2. With the aim of overcoming some of the deficiencies of BLEU , Doddington (2002) introduced the NIST metric.
    Page 2, “Previous Work on Machine Translation Meta-Evaluation”
  3. Lin and Och (2004) experimented, unlike previous works, with a wide set of metrics, including NIST, WER (NieBen et al., 2000), PER (Tillmann et al., 1997), and variants of ROUGE, BLEU and GTM.
    Page 2, “Previous Work on Machine Translation Meta-Evaluation”
  4. At the lexical level, we have included several standard metrics, based on different similarity assumptions: edit distance (WER, PER and TER), lexical precision ( BLEU and NIST), lexical recall (ROUGE), and F-measure (GTM and METEOR).
    Page 2, “Metrics and Test Beds”
  5. We have studied 100 sentence evaluation cases from representatives of each metric family including: 1-PER, BLEU , DP-Or-‘k, GTM (e = 2), METEOR and ROUGE L. The evaluation cases have been extracted from the four test beds.
    Page 6, “Alternatives to Correlation-based Meta-evaluation”

See all papers in Proc. ACL 2009 that mention BLEU.

See all papers in Proc. ACL that mention BLEU.

Back to top.

translation quality

Appears in 4 sentences as: translation quality (4)
In The Contribution of Linguistic Features to Automatic Machine Translation Evaluation
  1. The main goal of our work is to analyze to what extent deep linguistic features can contribute to the automatic evaluation of translation quality .
    Page 1, “Introduction”
  2. In all cases, translation quality is measured by comparing automatic translations against a set of human references.
    Page 2, “Metrics and Test Beds”
  3. Figure 5: Maximum translation quality decreasing over similarly scored translation pairs.
    Page 7, “Alternatives to Correlation-based Meta-evaluation”
  4. reliable for estimating the translation quality at the segment level, for predicting significant improvement between systems and for detecting poor and excellent translations.
    Page 8, “Conclusions”

See all papers in Proc. ACL 2009 that mention translation quality.

See all papers in Proc. ACL that mention translation quality.

Back to top.

Machine Translation

Appears in 3 sentences as: Machine Translation (2) machine translation (1)
In The Contribution of Linguistic Features to Automatic Machine Translation Evaluation
  1. Automatic evaluation methods based on similarity to human references have substantially accelerated the development cycle of many NLP tasks, such as Machine Translation , Automatic Summarization, Sentence Compression and Language Generation.
    Page 1, “Introduction”
  2. context of Machine Translation , a considerable effort has also been made to include deeper linguistic information in automatic evaluation metrics, both syntactic and semantic (see Section 2 for details).
    Page 1, “Introduction”
  3. Insofar as automatic evaluation metrics for machine translation have been proposed, different meta-evaluation frameworks have been gradually introduced.
    Page 2, “Previous Work on Machine Translation Meta-Evaluation”

See all papers in Proc. ACL 2009 that mention Machine Translation.

See all papers in Proc. ACL that mention Machine Translation.

Back to top.

translation system

Appears in 3 sentences as: translation system (2) translations system (1)
In The Contribution of Linguistic Features to Automatic Machine Translation Evaluation
  1. The translation system obviates some information which, in context, is not considered crucial by the human assessors.
    Page 6, “Alternatives to Correlation-based Meta-evaluation”
  2. We consider the set of translations system presented in each competition as the translation approaches pool.
    Page 7, “Alternatives to Correlation-based Meta-evaluation”
  3. In addition, our Combined System Test shows that, when training a combined translation system , using metrics at several linguistic processing levels improves substantially the use of individual metrics.
    Page 8, “Conclusions”

See all papers in Proc. ACL 2009 that mention translation system.

See all papers in Proc. ACL that mention translation system.

Back to top.