MEANT: An inexpensive, high-accuracy, semi-automatic metric for evaluating translation utility based on semantic roles
Lo, Chi-kiu and Wu, Dekai

Article Structure

Abstract

We introduce a novel semiautomated metric, MEANT, that assesses translation utility by matching semantic role fillers, producing scores that correlate with human judgment as well as HTER but at much lower labor cost.

Topics

evaluation metric

Appears in 19 sentences as: evaluation metric (12) evaluation metrics (8)
In MEANT: An inexpensive, high-accuracy, semi-automatic metric for evaluating translation utility based on semantic roles
  1. As machine translation systems improve in lexical choice and fluency, the shortcomings of widespread n-gram based, fluency-oriented MT evaluation metrics such as BLEU, which fail to properly evaluate adequacy, become more apparent.
    Page 1, “Abstract”
  2. But more accurate, nonautomatic adequacy-oriented MT evaluation metrics like HTER are highly labor-intensive, which bottlenecks the evaluation cycle.
    Page 1, “Abstract”
  3. We then replace the human semantic role annotators with automatic shallow semantic parsing to further automate the evaluation metric, and show that even the semiautomated evaluation metric achieves a 0.34 correlation coefficient with human adequacy judgment, which is still about 80% as closely correlated as HTER despite an even lower labor co st for the evaluation procedure.
    Page 1, “Abstract”
  4. The results show that our proposed metric is significantly better correlated with human judgment on adequacy than current widespread automatic evaluation metrics , while being much more cost effective than HTER.
    Page 1, “Abstract”
  5. We argue that BLEU (Papineni et al., 2002) and other automatic n- gram based MT evaluation metrics do not adequately capture the similarity in meaning between the machine translation and the reference translation—which, ultimately, is essential for MT output to be useful.
    Page 1, “Abstract”
  6. As MT systems improve, the shortcomings of the n-gram based evaluation metrics are becoming more apparent.
    Page 1, “Abstract”
  7. We show empirically that our proposed SRL based evaluation metric , which uses untrained monolingual humans to annotate semantic frames in MT output, correlates with human adequacy judgments as well as HTER, and far better than BLEU and other commonly used metrics.
    Page 2, “Abstract”
  8. Finally, we show that replacing the human semantic role labelers with an automatic shallow semantic parser in our proposed metric yields an approximation that is about 80% as closely correlated with human judgment as HTER, at an even lower cost—and is still far better correlated than n-gram based evaluation metrics .
    Page 2, “Abstract”
  9. Both human and semiautomatic variants of the MEANT translation evaluation metric were meta-evaluated, as described next.
    Page 5, “Abstract”
  10. Table 3: Sentence-level correlation with human adequacy judgments, across the evaluation metrics .
    Page 6, “Abstract”
  11. Similarly, the time needed for running the evaluation metric is also significantly less than HTER—under at most 5 minutes per sentence, even for non-expert humans using no computer-assisted UI tools.
    Page 7, “Abstract”

See all papers in Proc. ACL 2011 that mention evaluation metric.

See all papers in Proc. ACL that mention evaluation metric.

Back to top.

semantic role

Appears in 15 sentences as: semantic role (8) semantic roles (8)
In MEANT: An inexpensive, high-accuracy, semi-automatic metric for evaluating translation utility based on semantic roles
  1. We introduce a novel semiautomated metric, MEANT, that assesses translation utility by matching semantic role fillers, producing scores that correlate with human judgment as well as HTER but at much lower labor cost.
    Page 1, “Abstract”
  2. We first show that when using untrained monolingual readers to annotate semantic roles in MT output, the nonautomatic version of the metric HMEANT achieves a 0.43 correlation coefficient with human adequacy judgments at the sentence level, far superior to BLEU at only 0.20, and equal to the far more expensive HTER.
    Page 1, “Abstract”
  3. We then replace the human semantic role annotators with automatic shallow semantic parsing to further automate the evaluation metric, and show that even the semiautomated evaluation metric achieves a 0.34 correlation coefficient with human adequacy judgment, which is still about 80% as closely correlated as HTER despite an even lower labor co st for the evaluation procedure.
    Page 1, “Abstract”
  4. We present the results of evaluating translation utility by measuring the accuracy within a semantic role labeling (SRL) framework.
    Page 2, “Abstract”
  5. Finally, we show that replacing the human semantic role labelers with an automatic shallow semantic parser in our proposed metric yields an approximation that is about 80% as closely correlated with human judgment as HTER, at an even lower cost—and is still far better correlated than n-gram based evaluation metrics.
    Page 2, “Abstract”
  6. Assuming for now that the metric aggregates ten types of semantic roles with uniform weight for each role (optimization of weights will be discussed later), then wpred = wj = 0.1, and so Cprecision and Oman are both zero while Pprecision and Precau are both 0.5.
    Page 5, “Abstract”
  7. Table 2: List of semantic roles that human judges are requested to label.
    Page 6, “Abstract”
  8. Readers’ language background thus affects their understanding of the translation, which could affect the accuracy of capturing the key semantic roles in the translation.
    Page 7, “Abstract”
  9. Both English monolinguals and Chinese-English bilinguals (Chinese as first language and English as second language) were employed to annotate the semantic roles .
    Page 7, “Abstract”
  10. One of the concerns of the proposed metric is that, given only minimal training on the task, humans would annotate the semantic roles so inconsistently as to reduce the reliability of the evaluation metric.
    Page 7, “Abstract”
  11. Role classification The agreement of classified roles is counted over the matching of the semantic role labels within two aligned word spans.
    Page 8, “Abstract”

See all papers in Proc. ACL 2011 that mention semantic role.

See all papers in Proc. ACL that mention semantic role.

Back to top.

human judgment

Appears in 9 sentences as: human judges (1) human judgment (9)
In MEANT: An inexpensive, high-accuracy, semi-automatic metric for evaluating translation utility based on semantic roles
  1. We introduce a novel semiautomated metric, MEANT, that assesses translation utility by matching semantic role fillers, producing scores that correlate with human judgment as well as HTER but at much lower labor cost.
    Page 1, “Abstract”
  2. The results show that our proposed metric is significantly better correlated with human judgment on adequacy than current widespread automatic evaluation metrics, while being much more cost effective than HTER.
    Page 1, “Abstract”
  3. (2006) and Koehn and Monz (2006) report cases where BLEU strongly disagree with human judgment on translation quality.
    Page 1, “Abstract”
  4. Finally, we show that replacing the human semantic role labelers with an automatic shallow semantic parser in our proposed metric yields an approximation that is about 80% as closely correlated with human judgment as HTER, at an even lower cost—and is still far better correlated than n-gram based evaluation metrics.
    Page 2, “Abstract”
  5. Table 2: List of semantic roles that human judges are requested to label.
    Page 6, “Abstract”
  6. in the same order as the human judgment and -1 means the systems are ranked in the reverse order as the human judgment .
    Page 6, “Abstract”
  7. We find that the correlation coefficient of the proposed with human judgment on adequacy drops when bilinguals are shown to the source input sentence during annotation.
    Page 7, “Abstract”
  8. The correlation with human judgment on adequacy of the fully automated SRL annotation version, i.e., applying AS SERT on both the reference translation and the MT output, of the SRL based evaluation metric is about 80% of that of HTER.
    Page 9, “Abstract”
  9. The results also show that the correlation with human judgment on adequacy of either one side of translation using automatic SRL is in the 85% to 95% range of that HTER.
    Page 9, “Abstract”

See all papers in Proc. ACL 2011 that mention human judgment.

See all papers in Proc. ACL that mention human judgment.

Back to top.

BLEU

Appears in 8 sentences as: BLEU (8)
In MEANT: An inexpensive, high-accuracy, semi-automatic metric for evaluating translation utility based on semantic roles
  1. As machine translation systems improve in lexical choice and fluency, the shortcomings of widespread n-gram based, fluency-oriented MT evaluation metrics such as BLEU , which fail to properly evaluate adequacy, become more apparent.
    Page 1, “Abstract”
  2. We first show that when using untrained monolingual readers to annotate semantic roles in MT output, the nonautomatic version of the metric HMEANT achieves a 0.43 correlation coefficient with human adequacy judgments at the sentence level, far superior to BLEU at only 0.20, and equal to the far more expensive HTER.
    Page 1, “Abstract”
  3. We argue that BLEU (Papineni et al., 2002) and other automatic n- gram based MT evaluation metrics do not adequately capture the similarity in meaning between the machine translation and the reference translation—which, ultimately, is essential for MT output to be useful.
    Page 1, “Abstract”
  4. While BLEU score performs well in capturing the translation fluency, Callison-Burch et al.
    Page 1, “Abstract”
  5. (2006) and Koehn and Monz (2006) report cases where BLEU strongly disagree with human judgment on translation quality.
    Page 1, “Abstract”
  6. We show empirically that our proposed SRL based evaluation metric, which uses untrained monolingual humans to annotate semantic frames in MT output, correlates with human adequacy judgments as well as HTER, and far better than BLEU and other commonly used metrics.
    Page 2, “Abstract”
  7. Metrics Kendall 1' HMEANT 0.4324 HTER 0.4324 NIST 0.2883 BLEU 0.1982 METEOR 0.1982 TER 0.1982 PER 0.1982 CDER 0.1 17 1 WER 0.0991
    Page 6, “Abstract”
  8. MEANT gold - auto * 0.3694 MEANT auto - auto * 0.3423 NIST 0.2883 BLEU / METEOR / TER / PER 0.1982 CDER 0.117 1 WER 0.0991
    Page 9, “Abstract”

See all papers in Proc. ACL 2011 that mention BLEU.

See all papers in Proc. ACL that mention BLEU.

Back to top.

role labeling

Appears in 7 sentences as: role label (2) role labelers (1) role labeling (3) role labels (1)
In MEANT: An inexpensive, high-accuracy, semi-automatic metric for evaluating translation utility based on semantic roles
  1. We present the results of evaluating translation utility by measuring the accuracy within a semantic role labeling (SRL) framework.
    Page 2, “Abstract”
  2. Finally, we show that replacing the human semantic role labelers with an automatic shallow semantic parser in our proposed metric yields an approximation that is about 80% as closely correlated with human judgment as HTER, at an even lower cost—and is still far better correlated than n-gram based evaluation metrics.
    Page 2, “Abstract”
  3. Table 7: Inter-annotator agreement rate on role classification (matching of role label associated with matched word span)
    Page 8, “Abstract”
  4. Role classification The agreement of classified roles is counted over the matching of the semantic role labels within two aligned word spans.
    Page 8, “Abstract”
  5. Mlabel denotes the number of annotated predicates and arguments with matching role label between annotators.
    Page 8, “Abstract”
  6. It is now worth asking a deeper question: can we further reduce the labor cost of MEANT by using automatic shallow semantic parsing instead of humans for semantic role labeling ?
    Page 8, “Abstract”
  7. parser, we used ASSERT (Pradhan et al., 2004) which achieves roughly 87% semantic role labeling accuracy.
    Page 9, “Abstract”

See all papers in Proc. ACL 2011 that mention role labeling.

See all papers in Proc. ACL that mention role labeling.

Back to top.

semantic parsing

Appears in 5 sentences as: semantic parser (1) semantic parsers (1) semantic parsing (3)
In MEANT: An inexpensive, high-accuracy, semi-automatic metric for evaluating translation utility based on semantic roles
  1. We then replace the human semantic role annotators with automatic shallow semantic parsing to further automate the evaluation metric, and show that even the semiautomated evaluation metric achieves a 0.34 correlation coefficient with human adequacy judgment, which is still about 80% as closely correlated as HTER despite an even lower labor co st for the evaluation procedure.
    Page 1, “Abstract”
  2. Finally, we show that replacing the human semantic role labelers with an automatic shallow semantic parser in our proposed metric yields an approximation that is about 80% as closely correlated with human judgment as HTER, at an even lower cost—and is still far better correlated than n-gram based evaluation metrics.
    Page 2, “Abstract”
  3. It is now worth asking a deeper question: can we further reduce the labor cost of MEANT by using automatic shallow semantic parsing instead of humans for semantic role labeling?
    Page 8, “Abstract”
  4. For SRL annotation, we replace humans with automatic shallow semantic parsing .
    Page 8, “Abstract”
  5. Another interesting investigation would then be to similarly replicate this analysis of the impact of each individual role, but using automatically rather than manually labeled semantic roles, in order to ascertain whether the more difficult semantic roles for automatic semantic parsers might also correspond to the less important aspects of end-to-end MT utility.
    Page 9, “Abstract”

See all papers in Proc. ACL 2011 that mention semantic parsing.

See all papers in Proc. ACL that mention semantic parsing.

Back to top.

semantic role labeling

Appears in 5 sentences as: semantic role labelers (1) semantic role labeling (3) semantic role labels (1)
In MEANT: An inexpensive, high-accuracy, semi-automatic metric for evaluating translation utility based on semantic roles
  1. We present the results of evaluating translation utility by measuring the accuracy within a semantic role labeling (SRL) framework.
    Page 2, “Abstract”
  2. Finally, we show that replacing the human semantic role labelers with an automatic shallow semantic parser in our proposed metric yields an approximation that is about 80% as closely correlated with human judgment as HTER, at an even lower cost—and is still far better correlated than n-gram based evaluation metrics.
    Page 2, “Abstract”
  3. Role classification The agreement of classified roles is counted over the matching of the semantic role labels within two aligned word spans.
    Page 8, “Abstract”
  4. It is now worth asking a deeper question: can we further reduce the labor cost of MEANT by using automatic shallow semantic parsing instead of humans for semantic role labeling ?
    Page 8, “Abstract”
  5. parser, we used ASSERT (Pradhan et al., 2004) which achieves roughly 87% semantic role labeling accuracy.
    Page 9, “Abstract”

See all papers in Proc. ACL 2011 that mention semantic role labeling.

See all papers in Proc. ACL that mention semantic role labeling.

Back to top.

MT system

Appears in 4 sentences as: MT system (2) MT systems (2)
In MEANT: An inexpensive, high-accuracy, semi-automatic metric for evaluating translation utility based on semantic roles
  1. As MT systems improve, the shortcomings of the n-gram based evaluation metrics are becoming more apparent.
    Page 1, “Abstract”
  2. State-of-the-art MT systems are often able to output fluent translations that are nearly grammatical and contain roughly the correct words, but still fail to eXpress meaning that is close to the input.
    Page 1, “Abstract”
  3. , 2006) is more adequacy-oriented, it is only employed in very large scale MT system evaluation instead of day-to-day research activities.
    Page 1, “Abstract”
  4. The human decisions should also be defined in a way that can be closely approximated by automatic methods, so that similar objective functions might potentially be used for tuning in MT system development cycles.
    Page 2, “Abstract”

See all papers in Proc. ACL 2011 that mention MT system.

See all papers in Proc. ACL that mention MT system.

Back to top.

n-gram

Appears in 4 sentences as: N-gram (1) n-gram (3)
In MEANT: An inexpensive, high-accuracy, semi-automatic metric for evaluating translation utility based on semantic roles
  1. As machine translation systems improve in lexical choice and fluency, the shortcomings of widespread n-gram based, fluency-oriented MT evaluation metrics such as BLEU, which fail to properly evaluate adequacy, become more apparent.
    Page 1, “Abstract”
  2. N-gram based metrics assume that “good” translations tend to share the same leXical choices as the reference translations.
    Page 1, “Abstract”
  3. As MT systems improve, the shortcomings of the n-gram based evaluation metrics are becoming more apparent.
    Page 1, “Abstract”
  4. Finally, we show that replacing the human semantic role labelers with an automatic shallow semantic parser in our proposed metric yields an approximation that is about 80% as closely correlated with human judgment as HTER, at an even lower cost—and is still far better correlated than n-gram based evaluation metrics.
    Page 2, “Abstract”

See all papers in Proc. ACL 2011 that mention n-gram.

See all papers in Proc. ACL that mention n-gram.

Back to top.

human annotators

Appears in 3 sentences as: human annotators (5)
In MEANT: An inexpensive, high-accuracy, semi-automatic metric for evaluating translation utility based on semantic roles
  1. We study the cost/benefit tradeoff of using human annotators from different language backgrounds for the proposed evaluation metric, and compare whether providing the original source text helps.
    Page 7, “Abstract”
  2. The correlation coefficient of the SRL based evaluation metric driven by bilingual human annotators (0.351) is slightly better than that driven by monolingual human annotators (0.315); however, using bilinguals in the evaluation process is more costly than using monolinguals.
    Page 7, “Abstract”
  3. The correlation coefficient of the SRL based evaluation metric driven by bilingual human annotators who see also the source input sentences is 0.315 which is the same as that driven by monolingual human annotators .
    Page 7, “Abstract”

See all papers in Proc. ACL 2011 that mention human annotators.

See all papers in Proc. ACL that mention human annotators.

Back to top.

machine translation

Appears in 3 sentences as: machine translation (3)
In MEANT: An inexpensive, high-accuracy, semi-automatic metric for evaluating translation utility based on semantic roles
  1. As machine translation systems improve in lexical choice and fluency, the shortcomings of widespread n-gram based, fluency-oriented MT evaluation metrics such as BLEU, which fail to properly evaluate adequacy, become more apparent.
    Page 1, “Abstract”
  2. We argue that BLEU (Papineni et al., 2002) and other automatic n- gram based MT evaluation metrics do not adequately capture the similarity in meaning between the machine translation and the reference translation—which, ultimately, is essential for MT output to be useful.
    Page 1, “Abstract”
  3. the most essential semantic information being captured by machine translation systems?
    Page 2, “Abstract”

See all papers in Proc. ACL 2011 that mention machine translation.

See all papers in Proc. ACL that mention machine translation.

Back to top.

Sentence-level

Appears in 3 sentences as: Sentence-level (3)
In MEANT: An inexpensive, high-accuracy, semi-automatic metric for evaluating translation utility based on semantic roles
  1. Table 3: Sentence-level correlation with human adequacy judgments, across the evaluation metrics.
    Page 6, “Abstract”
  2. Table 5: Sentence-level correlation with human adequacy judgments, for monolinguals vs. bilinguals.
    Page 7, “Abstract”
  3. Table 8: Sentence-level correlation with human adequacy judgments.
    Page 9, “Abstract”

See all papers in Proc. ACL 2011 that mention Sentence-level.

See all papers in Proc. ACL that mention Sentence-level.

Back to top.