XMEANT: Better semantic MT evaluation without reference translations
Lo, Chi-kiu and Beloucif, Meriem and Saers, Markus and Wu, Dekai

Article Structure

Abstract

We introduce XMEANT—a new cross-lingual version of the semantic frame based MT evaluation metric MEAN T—which can correlate even more closely with human adequacy judgments than monolingual MEANT and eliminates the need for expensive human references.

Introduction

We show that XMEANT, a new cross-lingual version of MEANT (Lo et al., 2012), correlates with human judgment even more closely than MEANT for evaluating MT adequacy via semantic frames, despite discarding the need for expensive human reference translations.

Related Work

2.1 MT evaluation metrics

XMEANT: a cross-lingual MEANT

Like MEANT, XMEANT aims to evaluate how well MT preserves the core semantics, while maintaining full representational transparency.

Results

Table 1 shows that for human adequacy judgments at the sentence level, the f-score based XMEANT (l) correlates significantly more closely than other commonly used monolingual automatic MT evaluation metrics, and (2) even correlates nearly as well as monolingual MEANT.

Conclusion

We have presented XMEANT, a new cross-lingual variant of MEANT, that correlates even more closely with human translation adequacy judgments than MEANT, without the expensive human references.

Acknowledgments

This material is based upon work supported in part by the Defense Advanced Research Projects Agency (DARPA) under BOLT contract nos.

Topics

cross-lingual

Appears in 18 sentences as: Cross-lingual (1) cross-lingual (17)
In XMEANT: Better semantic MT evaluation without reference translations
  1. We introduce XMEANT—a new cross-lingual version of the semantic frame based MT evaluation metric MEAN T—which can correlate even more closely with human adequacy judgments than monolingual MEANT and eliminates the need for expensive human references.
    Page 1, “Abstract”
  2. However, to go beyond tuning weights in the loglinear SMT model, a cross-lingual objective function that can deeply integrate semantic frame criteria into the MT training pipeline is needed.
    Page 1, “Abstract”
  3. We show that cross-lingual XMEANT outperforms monolingual MEANT by (l) replacing the monolingual context vector model in MEANT with simple translation probabilities, and (2) incorporating bracketing ITG constraints.
    Page 1, “Abstract”
  4. We show that XMEANT, a new cross-lingual version of MEANT (Lo et al., 2012), correlates with human judgment even more closely than MEANT for evaluating MT adequacy via semantic frames, despite discarding the need for expensive human reference translations.
    Page 1, “Introduction”
  5. Our results suggest that MT translation adequacy is more accurately evaluated via the cross-lingual semantic frame similarities of the input and the MT output which may obviate the need for expensive human reference translations.
    Page 1, “Introduction”
  6. In order to continue driving MT towards better translation adequacy by deeply integrating semantic frame criteria into the MT training pipeline, it is necessary to have a cross-lingual semantic objective function that assesses the semantic frame similarities of input and output sentences.
    Page 1, “Introduction”
  7. We therefore propose XMEANT, a cross-lingual MT evaluation metric, that modifies MEANT using (1) simple translation probabilities (in our experiments,
    Page 1, “Introduction”
  8. Evaluating cross-lingual MT quality is similar to the work of MT quality estimation (QE).
    Page 3, “Related Work”
  9. Figure 3: Cross-lingual XMEANT algorithm.
    Page 4, “Related Work”
  10. But whereas MEANT measures lexical similarity using a monolingual context vector model, XMEANT instead substitutes simple cross-lingual lexical translation probabilities.
    Page 4, “XMEANT: a cross-lingual MEANT”
  11. To aggregate individual lexical translation probabilities into phrasal similarities between cross-lingual semantic role fillers, we compared two natural approaches to generalizing MEANT’s method of comparing semantic parses, as described below.
    Page 4, “XMEANT: a cross-lingual MEANT”

See all papers in Proc. ACL 2014 that mention cross-lingual.

See all papers in Proc. ACL that mention cross-lingual.

Back to top.

semantic role

Appears in 17 sentences as: semantic role (19)
In XMEANT: Better semantic MT evaluation without reference translations
  1. XMEANT is obtained by (1) using simple lexical translation probabilities, instead of the monolingual context vector model used in MEANT for computing the semantic role fillers similarities, and (2) incorporating bracketing ITG constrains for word alignment within the semantic role fillers.
    Page 1, “Introduction”
  2. MEANT (Lo et al., 2012), which is the weighted f-score over the matched semantic role labels of the automatically aligned semantic frames and role fillers, that outperforms BLEU, NIST, METEOR, WER, CDER and TER in correlation with human adequacy judgments.
    Page 2, “Related Work”
  3. MEANT is easily portable to other languages, requiring only an automatic semantic parser and a large monolingual corpus in the output language for identifying the semantic structures and the lexical similarity between the semantic role fillers of the reference and translation.
    Page 2, “Related Work”
  4. There is a total of 12 weights for the set of semantic role labels in MEANT as defined in Lo and Wu (2011b).
    Page 3, “Related Work”
  5. For UMEANT (Lo and Wu, 2012), they are estimated in an unsupervised manner using relative frequency of each semantic role label in the references and thus UMEANT is useful when human judgments on adequacy of the development set are unavailable.
    Page 3, “Related Work”
  6. (2012) described how the lexical and phrasal similarities of the semantic role fillers are computed.
    Page 3, “Related Work”
  7. In this paper, we employ a newer version of MEANT that uses f-score to aggregate individual token similarities into the composite phrasal similarities of semantic role fillers, as our experiments indicate this is more accurate than the previously used aggregation functions.
    Page 3, “Related Work”
  8. The weights can also be estimated in unsupervised fashion using the relative frequency of each semantic role label in the foreign input, as in UMEANT.
    Page 4, “XMEANT: a cross-lingual MEANT”
  9. To aggregate individual lexical translation probabilities into phrasal similarities between cross-lingual semantic role fillers, we compared two natural approaches to generalizing MEANT’s method of comparing semantic parses, as described below.
    Page 4, “XMEANT: a cross-lingual MEANT”
  10. 3.1 Applying MEANT’s f-score within semantic role fillers
    Page 4, “XMEANT: a cross-lingual MEANT”
  11. ing lexical translation probabilities within semantic role filler phrases.
    Page 4, “XMEANT: a cross-lingual MEANT”

See all papers in Proc. ACL 2014 that mention semantic role.

See all papers in Proc. ACL that mention semantic role.

Back to top.

translation probabilities

Appears in 11 sentences as: translation probabilities (10) translation probability (1)
In XMEANT: Better semantic MT evaluation without reference translations
  1. We show that cross-lingual XMEANT outperforms monolingual MEANT by (l) replacing the monolingual context vector model in MEANT with simple translation probabilities , and (2) incorporating bracketing ITG constraints.
    Page 1, “Abstract”
  2. XMEANT is obtained by (1) using simple lexical translation probabilities , instead of the monolingual context vector model used in MEANT for computing the semantic role fillers similarities, and (2) incorporating bracketing ITG constrains for word alignment within the semantic role fillers.
    Page 1, “Introduction”
  3. We therefore propose XMEANT, a cross-lingual MT evaluation metric, that modifies MEANT using (1) simple translation probabilities (in our experiments,
    Page 1, “Introduction”
  4. Apply the maximum weighted bipartite matching algorithm to align the semantic frames between the foreign input and MT output according to the lexical translation probabilities of the predicates.
    Page 4, “Related Work”
  5. For each pair of the aligned frames, apply the maximum weighted bipartite matching algorithm to align the arguments between the foreign input and MT output according to the aggregated phrasal translation probabilities of the role fillers.
    Page 4, “Related Work”
  6. But whereas MEANT measures lexical similarity using a monolingual context vector model, XMEANT instead substitutes simple cross-lingual lexical translation probabilities .
    Page 4, “XMEANT: a cross-lingual MEANT”
  7. To aggregate individual lexical translation probabilities into phrasal similarities between cross-lingual semantic role fillers, we compared two natural approaches to generalizing MEANT’s method of comparing semantic parses, as described below.
    Page 4, “XMEANT: a cross-lingual MEANT”
  8. ing lexical translation probabilities within semantic role filler phrases.
    Page 4, “XMEANT: a cross-lingual MEANT”
  9. We therefore relax the assumption and thus for cross-lingual phrasal precisiorflrecall, we align each token of the role fillers in the output/input string to the token of the role fillers in the input/output string that has the maximum lexical translation probability .
    Page 4, “XMEANT: a cross-lingual MEANT”
  10. The second natural approach is to extend MEANT’s ITG bias on compositional reordering, so as to also apply to aggregating lexical translation probabilities within semantic role filler phrases.
    Page 4, “XMEANT: a cross-lingual MEANT”
  11. This is (l) accomplished by replacing monolingual MEANT’s context vector model with simple translation probabilities when computing similarities of semantic role fillers, and (2) further improved by incorporating BITG constraints for aligning the tokens in semantic role fillers.
    Page 5, “Conclusion”

See all papers in Proc. ACL 2014 that mention translation probabilities.

See all papers in Proc. ACL that mention translation probabilities.

Back to top.

BLEU

Appears in 6 sentences as: BLEU (6)
In XMEANT: Better semantic MT evaluation without reference translations
  1. In addition, the translation adequacy across different genres (ranging from formal news to informal web forum and public speech) and different languages (English and Chinese) is improved by replacing BLEU or TER with MEANT during parameter tuning (Lo et al., 2013a; Lo and Wu, 2013a; Lo et al., 2013b).
    Page 1, “Introduction”
  2. Surface-form oriented metrics such as BLEU (Pa-pineni et al., 2002), NIST (Doddington, 2002), METEOR (Banerjee and Lavie, 2005), CDER (Leusch et al., 2006), WER (NieBen et al., 2000), and TER (Snover et al., 2006) do not correctly reflect the meaning similarities of the input sentence.
    Page 2, “Related Work”
  3. In fact, a number of large scale meta-evaluations (Callison-Burch et al., 2006; Koehn and Monz, 2006) report cases where BLEU strongly disagrees with human judgments of translation adequacy.
    Page 2, “Related Work”
  4. TINE (Rios et al., 2011) is a recall-oriented metric which aims to preserve the basic event structure but it performs comparably to BLEU and worse than METEOR on correlation with human adequacy judgments.
    Page 2, “Related Work”
  5. MEANT (Lo et al., 2012), which is the weighted f-score over the matched semantic role labels of the automatically aligned semantic frames and role fillers, that outperforms BLEU , NIST, METEOR, WER, CDER and TER in correlation with human adequacy judgments.
    Page 2, “Related Work”
  6. tems against MEANT produces more robustly adequate translations than the common practice of tuning against BLEU or TER across different data genres, such as formal newswire text, informal web forum text and informal public speech.
    Page 3, “Related Work”

See all papers in Proc. ACL 2014 that mention BLEU.

See all papers in Proc. ACL that mention BLEU.

Back to top.

f-score

Appears in 6 sentences as: f-score (6)
In XMEANT: Better semantic MT evaluation without reference translations
  1. MEANT (Lo et al., 2012), which is the weighted f-score over the matched semantic role labels of the automatically aligned semantic frames and role fillers, that outperforms BLEU, NIST, METEOR, WER, CDER and TER in correlation with human adequacy judgments.
    Page 2, “Related Work”
  2. In this paper, we employ a newer version of MEANT that uses f-score to aggregate individual token similarities into the composite phrasal similarities of semantic role fillers, as our experiments indicate this is more accurate than the previously used aggregation functions.
    Page 3, “Related Work”
  3. Compute the weighted f-score over the matching role labels of these aligned predicates and role fillers according to the definitions similar to those in section 2.2 except for replacing REF with IN in qij and wil .
    Page 4, “Related Work”
  4. 3.1 Applying MEANT’s f-score within semantic role fillers
    Page 4, “XMEANT: a cross-lingual MEANT”
  5. The first natural approach is to extend MEANT’s f-score based method of aggregating semantic parse accuracy, so as to also apply to aggregat-
    Page 4, “XMEANT: a cross-lingual MEANT”
  6. Table 1 shows that for human adequacy judgments at the sentence level, the f-score based XMEANT (l) correlates significantly more closely than other commonly used monolingual automatic MT evaluation metrics, and (2) even correlates nearly as well as monolingual MEANT.
    Page 5, “Results”

See all papers in Proc. ACL 2014 that mention f-score.

See all papers in Proc. ACL that mention f-score.

Back to top.

semantic parser

Appears in 6 sentences as: semantic parse (1) semantic parser (4) semantic parses (2)
In XMEANT: Better semantic MT evaluation without reference translations
  1. MEANT is easily portable to other languages, requiring only an automatic semantic parser and a large monolingual corpus in the output language for identifying the semantic structures and the lexical similarity between the semantic role fillers of the reference and translation.
    Page 2, “Related Work”
  2. Apply an input language automatic shallow semantic parser to the foreign input and an output language automatic shallow semantic parser totheMToutput.
    Page 4, “Related Work”
  3. (Figure 2 shows examples of automatic shallow semantic parses on both foreign input and MT output.
    Page 4, “Related Work”
  4. The Chinese semantic parser used in our experiments is C-ASSERT in (Fung et al., 2004, 2007).)
    Page 4, “Related Work”
  5. To aggregate individual lexical translation probabilities into phrasal similarities between cross-lingual semantic role fillers, we compared two natural approaches to generalizing MEANT’s method of comparing semantic parses , as described below.
    Page 4, “XMEANT: a cross-lingual MEANT”
  6. The first natural approach is to extend MEANT’s f-score based method of aggregating semantic parse accuracy, so as to also apply to aggregat-
    Page 4, “XMEANT: a cross-lingual MEANT”

See all papers in Proc. ACL 2014 that mention semantic parser.

See all papers in Proc. ACL that mention semantic parser.

Back to top.

evaluation metrics

Appears in 5 sentences as: evaluation metric (2) evaluation metrics (3)
In XMEANT: Better semantic MT evaluation without reference translations
  1. We introduce XMEANT—a new cross-lingual version of the semantic frame based MT evaluation metric MEAN T—which can correlate even more closely with human adequacy judgments than monolingual MEANT and eliminates the need for expensive human references.
    Page 1, “Abstract”
  2. It is well established that the MEANT family of metrics correlates better with human adequacy judgments than commonly used MT evaluation metrics (Lo and Wu, 2011a, 2012; Lo et al., 2012; Lo and Wu, 2013b; Machacek and Bojar, 2013).
    Page 1, “Introduction”
  3. We therefore propose XMEANT, a cross-lingual MT evaluation metric , that modifies MEANT using (1) simple translation probabilities (in our experiments,
    Page 1, “Introduction”
  4. 2.1 MT evaluation metrics
    Page 2, “Related Work”
  5. Table 1 shows that for human adequacy judgments at the sentence level, the f-score based XMEANT (l) correlates significantly more closely than other commonly used monolingual automatic MT evaluation metrics , and (2) even correlates nearly as well as monolingual MEANT.
    Page 5, “Results”

See all papers in Proc. ACL 2014 that mention evaluation metrics.

See all papers in Proc. ACL that mention evaluation metrics.

Back to top.

human judgement

Appears in 5 sentences as: human judgement (2) human judgment (2) human judgments (2)
In XMEANT: Better semantic MT evaluation without reference translations
  1. We show that XMEANT, a new cross-lingual version of MEANT (Lo et al., 2012), correlates with human judgment even more closely than MEANT for evaluating MT adequacy via semantic frames, despite discarding the need for expensive human reference translations.
    Page 1, “Introduction”
  2. In fact, a number of large scale meta-evaluations (Callison-Burch et al., 2006; Koehn and Monz, 2006) report cases where BLEU strongly disagrees with human judgments of translation adequacy.
    Page 2, “Related Work”
  3. ULC (Gimenez and Marquez, 2007, 2008) incorporates several semantic features and shows improved correlation with human judgement on translation quality (Callison-Burch et al., 2007, 2008) but no work has been done towards tuning an SMT system using a pure form of ULC perhaps due to its expensive run time.
    Page 2, “Related Work”
  4. For UMEANT (Lo and Wu, 2012), they are estimated in an unsupervised manner using relative frequency of each semantic role label in the references and thus UMEANT is useful when human judgments on adequacy of the development set are unavailable.
    Page 3, “Related Work”
  5. To address this problem, Quirk (2004) related the sentence-level correctness of the QE model to human judgment and achieved a high correlation with human judgement for a small annotated corpus; however, the proposed model does not scale well to larger data sets.
    Page 3, “Related Work”

See all papers in Proc. ACL 2014 that mention human judgement.

See all papers in Proc. ACL that mention human judgement.

Back to top.

role labels

Appears in 5 sentences as: role label (2) role labels (3)
In XMEANT: Better semantic MT evaluation without reference translations
  1. MEANT (Lo et al., 2012), which is the weighted f-score over the matched semantic role labels of the automatically aligned semantic frames and role fillers, that outperforms BLEU, NIST, METEOR, WER, CDER and TER in correlation with human adequacy judgments.
    Page 2, “Related Work”
  2. There is a total of 12 weights for the set of semantic role labels in MEANT as defined in Lo and Wu (2011b).
    Page 3, “Related Work”
  3. For UMEANT (Lo and Wu, 2012), they are estimated in an unsupervised manner using relative frequency of each semantic role label in the references and thus UMEANT is useful when human judgments on adequacy of the development set are unavailable.
    Page 3, “Related Work”
  4. Compute the weighted f-score over the matching role labels of these aligned predicates and role fillers according to the definitions similar to those in section 2.2 except for replacing REF with IN in qij and wil .
    Page 4, “Related Work”
  5. The weights can also be estimated in unsupervised fashion using the relative frequency of each semantic role label in the foreign input, as in UMEANT.
    Page 4, “XMEANT: a cross-lingual MEANT”

See all papers in Proc. ACL 2014 that mention role labels.

See all papers in Proc. ACL that mention role labels.

Back to top.

semantic role label

Appears in 4 sentences as: semantic role label (2) semantic role labels (2)
In XMEANT: Better semantic MT evaluation without reference translations
  1. MEANT (Lo et al., 2012), which is the weighted f-score over the matched semantic role labels of the automatically aligned semantic frames and role fillers, that outperforms BLEU, NIST, METEOR, WER, CDER and TER in correlation with human adequacy judgments.
    Page 2, “Related Work”
  2. There is a total of 12 weights for the set of semantic role labels in MEANT as defined in Lo and Wu (2011b).
    Page 3, “Related Work”
  3. For UMEANT (Lo and Wu, 2012), they are estimated in an unsupervised manner using relative frequency of each semantic role label in the references and thus UMEANT is useful when human judgments on adequacy of the development set are unavailable.
    Page 3, “Related Work”
  4. The weights can also be estimated in unsupervised fashion using the relative frequency of each semantic role label in the foreign input, as in UMEANT.
    Page 4, “XMEANT: a cross-lingual MEANT”

See all papers in Proc. ACL 2014 that mention semantic role label.

See all papers in Proc. ACL that mention semantic role label.

Back to top.

TER

Appears in 4 sentences as: TER (4)
In XMEANT: Better semantic MT evaluation without reference translations
  1. In addition, the translation adequacy across different genres (ranging from formal news to informal web forum and public speech) and different languages (English and Chinese) is improved by replacing BLEU or TER with MEANT during parameter tuning (Lo et al., 2013a; Lo and Wu, 2013a; Lo et al., 2013b).
    Page 1, “Introduction”
  2. Surface-form oriented metrics such as BLEU (Pa-pineni et al., 2002), NIST (Doddington, 2002), METEOR (Banerjee and Lavie, 2005), CDER (Leusch et al., 2006), WER (NieBen et al., 2000), and TER (Snover et al., 2006) do not correctly reflect the meaning similarities of the input sentence.
    Page 2, “Related Work”
  3. MEANT (Lo et al., 2012), which is the weighted f-score over the matched semantic role labels of the automatically aligned semantic frames and role fillers, that outperforms BLEU, NIST, METEOR, WER, CDER and TER in correlation with human adequacy judgments.
    Page 2, “Related Work”
  4. tems against MEANT produces more robustly adequate translations than the common practice of tuning against BLEU or TER across different data genres, such as formal newswire text, informal web forum text and informal public speech.
    Page 3, “Related Work”

See all papers in Proc. ACL 2014 that mention TER.

See all papers in Proc. ACL that mention TER.

Back to top.

objective function

Appears in 3 sentences as: objective function (3)
In XMEANT: Better semantic MT evaluation without reference translations
  1. However, to go beyond tuning weights in the loglinear SMT model, a cross-lingual objective function that can deeply integrate semantic frame criteria into the MT training pipeline is needed.
    Page 1, “Abstract”
  2. In order to continue driving MT towards better translation adequacy by deeply integrating semantic frame criteria into the MT training pipeline, it is necessary to have a cross-lingual semantic objective function that assesses the semantic frame similarities of input and output sentences.
    Page 1, “Introduction”
  3. While monolingual MEANT alone accurately reflects adequacy via semantic frames and optimizing SMT against MEANT improves translation, the new cross-lingual XMEANT semantic objective function moves closer toward deep integration of semantics into the MT training pipeline.
    Page 5, “Conclusion”

See all papers in Proc. ACL 2014 that mention objective function.

See all papers in Proc. ACL that mention objective function.

Back to top.

sentence-level

Appears in 3 sentences as: Sentence-level (1) sentence-level (2)
In XMEANT: Better semantic MT evaluation without reference translations
  1. (2004) introduced a sentence-level QE system where an arbitrary threshold is used to classify the MT output as good or bad.
    Page 3, “Related Work”
  2. To address this problem, Quirk (2004) related the sentence-level correctness of the QE model to human judgment and achieved a high correlation with human judgement for a small annotated corpus; however, the proposed model does not scale well to larger data sets.
    Page 3, “Related Work”
  3. Table l: Sentence-level correlation with HAJ
    Page 5, “Results”

See all papers in Proc. ACL 2014 that mention sentence-level.

See all papers in Proc. ACL that mention sentence-level.

Back to top.

word alignment

Appears in 3 sentences as: word alignment (3)
In XMEANT: Better semantic MT evaluation without reference translations
  1. XMEANT is obtained by (1) using simple lexical translation probabilities, instead of the monolingual context vector model used in MEANT for computing the semantic role fillers similarities, and (2) incorporating bracketing ITG constrains for word alignment within the semantic role fillers.
    Page 1, “Introduction”
  2. than that of the reference translation, and on the other hand, the BITG constraints the word alignment more accurately than the heuristic bag-of-word aggregation used in MEANT.
    Page 1, “Introduction”
  3. It is also consistent with results observed while estimating word alignment probabilities, where BITG constraints outperformed alignments from GIZA++ (Saers and Wu, 2009).
    Page 5, “Results”

See all papers in Proc. ACL 2014 that mention word alignment.

See all papers in Proc. ACL that mention word alignment.

Back to top.