PORT: a Precision-Order-Recall MT Evaluation Metric for Tuning
Chen, Boxing and Kuhn, Roland and Larkin, Samuel

Article Structure

Abstract

Many machine translation (MT) evaluation metrics have been shown to correlate better with human judgment than BLEU.

Introduction

Automatic evaluation metrics for machine translation (MT) quality are a key part of building statistical MT (SMT) systems.

BLEU and PORT

First, define n-gram precision p(n) and recall r(n):

Experiments

3.1 PORT as an Evaluation Metric

Conclusions

In this paper, we have proposed a new tuning metric for SMT systems.

Topics

BLEU

Appears in 66 sentences as: BLEU (88)
In PORT: a Precision-Order-Recall MT Evaluation Metric for Tuning
  1. Many machine translation (MT) evaluation metrics have been shown to correlate better with human judgment than BLEU .
    Page 1, “Abstract”
  2. In principle, tuning on these metrics should yield better systems than tuning on BLEU .
    Page 1, “Abstract”
  3. It has a better correlation with human judgment than BLEU .
    Page 1, “Abstract”
  4. PORT tuning achieves consistently better performance than BLEU tuning, according to four automated metrics (including BLEU) and to human evaluation: in comparisons of outputs from 300 source sentences, human judges preferred the PORT-tuned output 45.3% of the time (vs. 32.7% BLEU tuning preferences and 22.0% ties).
    Page 1, “Abstract”
  5. 0 BLEU (Papineni et al., 2002), NIST (Doddington, 2002), WER, PER, TER (Snover et al., 2006), and LRscore (Birch and Osborne, 2011) do not use external linguistic
    Page 1, “Introduction”
  6. Among these metrics, BLEU is the most widely used for both evaluation and tuning.
    Page 1, “Introduction”
  7. Many of the metrics correlate better with human judgments of translation quality than BLEU , as shown in recent WMT Evaluation Task reports (Callison-Burch et
    Page 1, “Introduction”
  8. However, BLEU remains the de facto standard tuning metric, for two reasons.
    Page 2, “Introduction”
  9. (2010) showed that BLEU tuning is more robust than tuning with other metrics (METEOR, TER, etc.
    Page 2, “Introduction”
  10. (2011) claimed that TESLA tuning performed better than BLEU tuning according to human judgment.
    Page 2, “Introduction”
  11. In this work, our goal is to devise a metric that, like BLEU, is computationally cheap and language-independent, but that yields better MT systems than BLEU when used for tuning.
    Page 2, “Introduction”

See all papers in Proc. ACL 2012 that mention BLEU.

See all papers in Proc. ACL that mention BLEU.

Back to top.

word alignment

Appears in 15 sentences as: word alignment (15) word alignments (4)
In PORT: a Precision-Order-Recall MT Evaluation Metric for Tuning
  1. We use word alignment to compute the two permutations (LRscore also uses word alignment ).
    Page 3, “BLEU and PORT”
  2. The word alignment between the source input and reference is computed using GIZA++ (Och and Ney, 2003) beforehand with the default settings, then is refined with the heuristic grow-diag-final-and; the word alignment between the source input and the translation is generated by the decoder with the help of word alignment inside each phrase pair.
    Page 3, “BLEU and PORT”
  3. These encode one-to-one relations but not one-to-many, many-to-one, many-to-many or null relations, all of which can occur in word alignments .
    Page 3, “BLEU and PORT”
  4. Inspired by HMM word alignment (Vogel et al., 1996), our second distance measure is based on jump width.
    Page 3, “BLEU and PORT”
  5. The use of v means that unlike BLEU, PORT requires word alignment information.
    Page 4, “BLEU and PORT”
  6. In order to compute the v part of PORT, we require source-target word alignments for the references and MT outputs.
    Page 4, “Experiments”
  7. Also, v depends on source-target word alignments for reference and test sets.
    Page 4, “Experiments”
  8. 3.2.5 Robustness to word alignment errors
    Page 7, “Experiments”
  9. PORT, unlike BLEU, depends on word alignments .
    Page 7, “Experiments”
  10. How does quality of word alignment between source and reference affect PORT tuning?
    Page 7, “Experiments”
  11. We also ran GIZA++ to obtain its automatic word alignment , computed on CTB and FBIS.
    Page 7, “Experiments”

See all papers in Proc. ACL 2012 that mention word alignment.

See all papers in Proc. ACL that mention word alignment.

Back to top.

evaluation metric

Appears in 12 sentences as: Evaluation Metric (2) evaluation metric (5) Evaluation metrics (1) evaluation metrics (4)
In PORT: a Precision-Order-Recall MT Evaluation Metric for Tuning
  1. Many machine translation (MT) evaluation metrics have been shown to correlate better with human judgment than BLEU.
    Page 1, “Abstract”
  2. This paper presents PORTl, a new MT evaluation metric which combines precision, recall and an ordering metric and which is primarily designed for tuning MT systems.
    Page 1, “Abstract”
  3. Automatic evaluation metrics for machine translation (MT) quality are a key part of building statistical MT (SMT) systems.
    Page 1, “Introduction”
  4. VIT Evaluation Metric for Tuning
    Page 1, “Introduction”
  5. These methods perform repeated decoding runs with different system parameter values, which are tuned to optimize the value of the evaluation metric over a development set with reference translations.
    Page 1, “Introduction”
  6. MT evaluation metrics fall into three groups:
    Page 1, “Introduction”
  7. Several ordering measures have been integrated into MT evaluation metrics recently.
    Page 3, “BLEU and PORT”
  8. 3.1 PORT as an Evaluation Metric
    Page 4, “Experiments”
  9. We studied PORT as an evaluation metric on WMT data; test sets include WMT 2008, WMT 2009, and WMT 2010 all-to-English, plus 2009, 2010 English-to-all submissions.
    Page 4, “Experiments”
  10. This is because we designed PORT to carry out tuning; we did not optimize its performance as an evaluation metric , but rather, to optimize system tuning performance.
    Page 4, “Experiments”
  11. Evaluation metrics (%) Task Tune BLEU MTR 1—TER PORT zh—en BLEU 26.8 55.2 38.0 49.7 small PORT 27.2* 55.7 38.0 50.0 zh—en BLEU 29.9 58.4 41.2 53.0 large PORT 30.3* 59.0 42.0 53.2 fr—en BLEU 38.8 69.8 54.2 57.1 Hans PORT 3 8.8 69.6 54.6 57.1 de—en BLEU 20.1 55.6 38.4 39.6 WMT PORT 20.3 56.0 38.4 39.7 en—de BLEU 13.6 43.3 30.1 31.7 WMT PORT 13 .6 43 .3 30.7 31.7
    Page 5, “Experiments”

See all papers in Proc. ACL 2012 that mention evaluation metric.

See all papers in Proc. ACL that mention evaluation metric.

Back to top.

human judgment

Appears in 12 sentences as: human judges (1) human judgment (8) human judgments (3)
In PORT: a Precision-Order-Recall MT Evaluation Metric for Tuning
  1. Many machine translation (MT) evaluation metrics have been shown to correlate better with human judgment than BLEU.
    Page 1, “Abstract”
  2. It has a better correlation with human judgment than BLEU.
    Page 1, “Abstract”
  3. PORT tuning achieves consistently better performance than BLEU tuning, according to four automated metrics (including BLEU) and to human evaluation: in comparisons of outputs from 300 source sentences, human judges preferred the PORT-tuned output 45.3% of the time (vs. 32.7% BLEU tuning preferences and 22.0% ties).
    Page 1, “Abstract”
  4. Many of the metrics correlate better with human judgments of translation quality than BLEU, as shown in recent WMT Evaluation Task reports (Callison-Burch et
    Page 1, “Introduction”
  5. Second, though a tuning metric should correlate strongly with human judgment , MERT (and similar algorithms) invoke the chosen metric so often that it must be computed quickly.
    Page 2, “Introduction”
  6. (2011) claimed that TESLA tuning performed better than BLEU tuning according to human judgment .
    Page 2, “Introduction”
  7. Results given below show that PORT correlates better with human judgments of translation quality than BLEU does, and sometimes outperforms METEOR in this respect, based on data from WMT (2008—2010).
    Page 2, “Introduction”
  8. However, since PORT is designed for tuning, the most important results are those showing that PORT tuning yields systems with better translations than those produced by BLEU tuning — both as determined by automatic metrics (including BLEU), and according to human judgment , as applied to five data conditions involving four language pairs.
    Page 2, “Introduction”
  9. We used Spearman’s rank correlation coefficient p to measure correlation of the metric with system-level human judgments of translation.
    Page 4, “Experiments”
  10. The human judgment score is based on the “Rank” only, i.e., how often the translations of the system were rated as better than those from other systems (Callison-Burch et al., 2008).
    Page 4, “Experiments”
  11. BLEU 0.792 0.215 0.777 0.240 METEOR 0.834 0.231 0.835 0.225 PORT 0.801 0.236 0.804 0.242 Table 2: Correlations with human judgment on WMT
    Page 4, “Experiments”

See all papers in Proc. ACL 2012 that mention human judgment.

See all papers in Proc. ACL that mention human judgment.

Back to top.

language pairs

Appears in 11 sentences as: language pair (2) language pairs (9)
In PORT: a Precision-Order-Recall MT Evaluation Metric for Tuning
  1. We compare PORT-tuned MT systems to BLEU-tuned baselines in five experimental conditions involving four language pairs .
    Page 1, “Abstract”
  2. However, since PORT is designed for tuning, the most important results are those showing that PORT tuning yields systems with better translations than those produced by BLEU tuning — both as determined by automatic metrics (including BLEU), and according to human judgment, as applied to five data conditions involving four language pairs .
    Page 2, “Introduction”
  3. For our experiments, we tuned a on Chinese-English data, setting it to 0.25 and keeping this value for the other language pairs .
    Page 4, “BLEU and PORT”
  4. Most WMT submissions involve language pairs with similar word order, so the ordering factor v in PORT won’t play a big role.
    Page 4, “Experiments”
  5. In internal tests we have found no systematic difference in dev-set BLEUs, so we speculate that PORT’s emphasis on reordering yields models that generalize better for these two language pairs .
    Page 6, “Experiments”
  6. Of the Table 5 language pairs , the one where PORT tuning helps most has the lowest BLEU in Table 4 (German-English); the one where it helps least in Table 5 has the highest BLEU in Table 4 (French-English).
    Page 6, “Experiments”
  7. PORT differs from BLEU partly in modeling long-distance reordering more accurately; English and French have similar word order, but the other two language pairs don’t.
    Page 7, “Experiments”
  8. However, for the European language pairs , PORT and Qmean seem to be tied.
    Page 8, “Experiments”
  9. What would results be on that language pair if we were to replace v in PORT with another ordering measure?
    Page 8, “Experiments”
  10. Most important, our results show that PORT-tuned MT systems yield better translations than BLEU-tuned systems on several language pairs , according both to automatic metrics and human evaluations.
    Page 8, “Conclusions”
  11. In future work, we plan to tune the free parameter 0t for each language pair .
    Page 8, “Conclusions”

See all papers in Proc. ACL 2012 that mention language pairs.

See all papers in Proc. ACL that mention language pairs.

Back to top.

word ordering

Appears in 11 sentences as: word order (4) Word ordering (1) word ordering (7)
In PORT: a Precision-Order-Recall MT Evaluation Metric for Tuning
  1. This expression is then further combined with a new measure of word ordering , v, designed to reflect long-distance as well as short-distance word reordering (BLEU only reflects short-distance reordering).
    Page 2, “Introduction”
  2. Word ordering measures for MT compare two permutations of the original source-language word sequence: the permutation represented by the sequence of corresponding words in the MT output, and the permutation in the reference.
    Page 3, “BLEU and PORT”
  3. (10) and the word ordering measure v are combined in a harmonic mean: PORT: 2 (17) 1/Qmean(N) +1/v“ Here a is a free parameter that is tuned on held-out data.
    Page 4, “BLEU and PORT”
  4. Most WMT submissions involve language pairs with similar word order , so the ordering factor v in PORT won’t play a big role.
    Page 4, “Experiments”
  5. PORT differs from BLEU partly in modeling long-distance reordering more accurately; English and French have similar word order , but the other two language pairs don’t.
    Page 7, “Experiments”
  6. The results in section 3.3 (below) for Qmean, a version of PORT without word ordering factor v, suggest v may be defined suboptimally for French-English.
    Page 7, “Experiments”
  7. (18) for Chinese-English, making the influence of word ordering measure v in PORT too strong for the European pairs, which have similar word order .
    Page 8, “Experiments”
  8. A related question is how much word ordering improvement we obtained from tuning with PORT.
    Page 8, “Experiments”
  9. We evaluate Chinese-English word ordering with three measures: Spearman’s p, Kendall’s I distance as applied to two permutations (see section 2.2.2) and our own measure v. Table 10 shows the effects of BLEU and PORT tuning on these three measures, for three test sets in the zh-en large condition.
    Page 8, “Experiments”
  10. From the table, we see that the PORT-tuned system yielded better word order than the BLEU-tuned system in all nine combinations of test sets and ordering measures.
    Page 8, “Experiments”
  11. PORT incorporates precision, recall, strict brevity penalty and strict redundancy penalty, plus a new word ordering measure v. As an evaluation metric, PORT performed better than BLEU at the system level and the segment level, and it was competitive with or slightly superior to METEOR at the segment level.
    Page 8, “Conclusions”

See all papers in Proc. ACL 2012 that mention word ordering.

See all papers in Proc. ACL that mention word ordering.

Back to top.

Chinese-English

Appears in 7 sentences as: Chinese-English (7)
In PORT: a Precision-Order-Recall MT Evaluation Metric for Tuning
  1. For our experiments, we tuned a on Chinese-English data, setting it to 0.25 and keeping this value for the other language pairs.
    Page 4, “BLEU and PORT”
  2. The large data condition uses training data from NIST3 2009 ( Chinese-English track).
    Page 5, “Experiments”
  3. We are currently investigating why PORT tuning gives higher BLEU scores than BLEU tuning for Chinese-English and German-English.
    Page 6, “Experiments”
  4. PORT outperforms Qmean on seven of the eight automatic scores shown for small and large Chinese-English .
    Page 7, “Experiments”
  5. (18) for Chinese-English , making the influence of word ordering measure v in PORT too strong for the European pairs, which have similar word order.
    Page 8, “Experiments”
  6. Measure v seems to help Chinese-English tuning.
    Page 8, “Experiments”
  7. We evaluate Chinese-English word ordering with three measures: Spearman’s p, Kendall’s I distance as applied to two permutations (see section 2.2.2) and our own measure v. Table 10 shows the effects of BLEU and PORT tuning on these three measures, for three test sets in the zh-en large condition.
    Page 8, “Experiments”

See all papers in Proc. ACL 2012 that mention Chinese-English.

See all papers in Proc. ACL that mention Chinese-English.

Back to top.

LM

Appears in 5 sentences as: LM (6)
In PORT: a Precision-Order-Recall MT Evaluation Metric for Tuning
  1. The first is a 4-gram LM which is estimated on the target side of the texts used in the large data condition (below).
    Page 5, “Experiments”
  2. The second is a 5-gram LM estimated on English Gigaword.
    Page 5, “Experiments”
  3. We used two LMs in loglinear combination: a 4—gram LM trained on the target side of the parallel
    Page 5, “Experiments”
  4. LM .
    Page 5, “Experiments”
  5. The two conditions both use an LM trained on the target side of the parallel training data, and de-en also uses the English Gigaword 5-gram LM .
    Page 5, “Experiments”

See all papers in Proc. ACL 2012 that mention LM.

See all papers in Proc. ACL that mention LM.

Back to top.

MT systems

Appears in 5 sentences as: MT systems (5)
In PORT: a Precision-Order-Recall MT Evaluation Metric for Tuning
  1. This paper presents PORTl, a new MT evaluation metric which combines precision, recall and an ordering metric and which is primarily designed for tuning MT systems .
    Page 1, “Abstract”
  2. We compare PORT-tuned MT systems to BLEU-tuned baselines in five experimental conditions involving four language pairs.
    Page 1, “Abstract”
  3. First, there is no evidence that any other tuning metric yields better MT systems .
    Page 2, “Introduction”
  4. In this work, our goal is to devise a metric that, like BLEU, is computationally cheap and language-independent, but that yields better MT systems than BLEU when used for tuning.
    Page 2, “Introduction”
  5. Most important, our results show that PORT-tuned MT systems yield better translations than BLEU-tuned systems on several language pairs, according both to automatic metrics and human evaluations.
    Page 8, “Conclusions”

See all papers in Proc. ACL 2012 that mention MT systems.

See all papers in Proc. ACL that mention MT systems.

Back to top.

TER

Appears in 5 sentences as: TER (5)
In PORT: a Precision-Order-Recall MT Evaluation Metric for Tuning
  1. 0 BLEU (Papineni et al., 2002), NIST (Doddington, 2002), WER, PER, TER (Snover et al., 2006), and LRscore (Birch and Osborne, 2011) do not use external linguistic
    Page 1, “Introduction”
  2. information; they are fast to compute (except TER ).
    Page 1, “Introduction”
  3. (2010) showed that BLEU tuning is more robust than tuning with other metrics (METEOR, TER , etc.
    Page 2, “Introduction”
  4. We employed BLEU4, METEOR (V1.0), TER (v0.7.25), and the new metric PORT.
    Page 6, “Experiments”
  5. In the table, TER scores are presented as 1-TER to ensure that for all metrics, higher scores mean higher quality.
    Page 6, “Experiments”

See all papers in Proc. ACL 2012 that mention TER.

See all papers in Proc. ACL that mention TER.

Back to top.

n-gram

Appears in 4 sentences as: n-gram (4)
In PORT: a Precision-Order-Recall MT Evaluation Metric for Tuning
  1. First, define n-gram precision p(n) and recall r(n):
    Page 2, “BLEU and PORT”
  2. where Pg N) is the geometr1c average of n-gram prec1s1ons
    Page 2, “BLEU and PORT”
  3. The average precision and average recall used in PORT (unlike those used in BLEU) are the arithmetic average of n-gram precisions Pa(N) and recalls Ra(N):
    Page 2, “BLEU and PORT”
  4. As usual, French-English is the outlier: the two outputs here are typically so similar that BLEU and Qmean tuning yield very similar n-gram statistics.
    Page 8, “Experiments”

See all papers in Proc. ACL 2012 that mention n-gram.

See all papers in Proc. ACL that mention n-gram.

Back to top.

n-grams

Appears in 4 sentences as: n-grams (5)
In PORT: a Precision-Order-Recall MT Evaluation Metric for Tuning
  1. translation hypothesis to compute the numbers of the reference n-grams .
    Page 2, “BLEU and PORT”
  2. Both BLEU and PORT perform matching of n-grams up to n = 4.
    Page 4, “Experiments”
  3. In all tuning experiments, both BLEU and PORT performed lower case matching of n-grams up to n = 4.
    Page 5, “Experiments”
  4. The BLEU-tuned and Qmean-tuned systems generate similar numbers of matching n-grams, but Qmean-tuned systems produce fewer n-grams (thus, shorter translations).
    Page 8, “Experiments”

See all papers in Proc. ACL 2012 that mention n-grams.

See all papers in Proc. ACL that mention n-grams.

Back to top.

NIST

Appears in 4 sentences as: NIST (5)
In PORT: a Precision-Order-Recall MT Evaluation Metric for Tuning
  1. 0 BLEU (Papineni et al., 2002), NIST (Doddington, 2002), WER, PER, TER (Snover et al., 2006), and LRscore (Birch and Osborne, 2011) do not use external linguistic
    Page 1, “Introduction”
  2. The dev set comprised mainly data from the NIST 2005 test set, and also some balanced-genre web-text from NIST .
    Page 5, “Experiments”
  3. Evaluation was performed on NIST 2006 and 2008.
    Page 5, “Experiments”
  4. Table 10: Ordering scores (p, I and v) for test sets NIST
    Page 8, “Experiments”

See all papers in Proc. ACL 2012 that mention NIST.

See all papers in Proc. ACL that mention NIST.

Back to top.

precision and recall

Appears in 3 sentences as: Precision and Recall (1) precision and recall (2)
In PORT: a Precision-Order-Recall MT Evaluation Metric for Tuning
  1. 2.2.1 Precision and Recall
    Page 2, “BLEU and PORT”
  2. To combine precision and recall , we tried four averaging methods: arithmetic (A), geometric (G), harmonic (H), and quadratic (Q) mean.
    Page 3, “BLEU and PORT”
  3. We chose the quadratic mean to combine precision and recall , as follows:
    Page 3, “BLEU and PORT”

See all papers in Proc. ACL 2012 that mention precision and recall.

See all papers in Proc. ACL that mention precision and recall.

Back to top.

translation quality

Appears in 3 sentences as: translation quality (3)
In PORT: a Precision-Order-Recall MT Evaluation Metric for Tuning
  1. Many of the metrics correlate better with human judgments of translation quality than BLEU, as shown in recent WMT Evaluation Task reports (Callison-Burch et
    Page 1, “Introduction”
  2. Results given below show that PORT correlates better with human judgments of translation quality than BLEU does, and sometimes outperforms METEOR in this respect, based on data from WMT (2008—2010).
    Page 2, “Introduction”
  3. Table 4 shows translation quality for BLEU- and PORT-tuned systems, as assessed by automatic metrics.
    Page 6, “Experiments”

See all papers in Proc. ACL 2012 that mention translation quality.

See all papers in Proc. ACL that mention translation quality.

Back to top.