Using Discourse Structure Improves Machine Translation Evaluation
Guzmán, Francisco and Joty, Shafiq and Màrquez, Llu'is and Nakov, Preslav

Article Structure

Abstract

We present experiments in using discourse structure for improving machine translation evaluation.

Introduction

From its foundations, Statistical Machine Translation (SMT) had two defining characteristics: first, translation was modeled as a generative process at the sentence-level.

Related Work

Addressing discourse-level phenomena in machine translation is relatively new as a research direction.

Our Discourse-Based Measures

Our working hypothesis is that the similarity between the discourse structures of an automatic and of a reference translation provides additional information that can be valuable for evaluating MT systems.

Experimental Setup

In our experiments, we used the data available for the WMT12 and the WMTll metrics shared tasks for translations into English.3 This included the output from the systems that participated in the WMT12 and the WMTll MT evaluation campaigns, both consisting of 3,003 sentences, for four different language pairs: Czech-English (CS-EN), French-English (FR-EN), German-English (DE-EN), and Spanish-English (ES-EN); as well as a dataset with the English references.

Experimental Results

In this section, we explore how discourse information can be used to improve machine translation evaluation metrics.

Conclusions and Future Work

In this paper we have shown that discourse structure can be used to improve automatic MT evaluation.

Topics

evaluation metrics

Appears in 21 sentences as: evaluation metric (2) Evaluation Metrics (1) evaluation metrics (20)
In Using Discourse Structure Improves Machine Translation Evaluation
  1. Then, we show that these measures can help improve a number of existing machine translation evaluation metrics both at the segment- and at the system-level.
    Page 1, “Abstract”
  2. Rather than proposing a single new metric, we show that discourse information is complementary to the state-of-the-art evaluation metrics, and thus should be taken into account in the development of future richer evaluation metrics .
    Page 1, “Abstract”
  3. We believe that the semantic and pragmatic information captured in the form of DTs (i) can help develop discourse-aware SMT systems that produce coherent translations, and (ii) can yield better MT evaluation metrics .
    Page 2, “Introduction”
  4. In this paper, rather than proposing yet another MT evaluation metric, we show that discourse information is complementary to many existing evaluation metrics , and thus should not be ignored.
    Page 2, “Introduction”
  5. We first design two discourse-aware similarity measures, which use DTs generated by a publicly-available discourse parser (J oty et al., 2012); then, we show that they can help improve a number of MT evaluation metrics at the segment- and at the system-level in the context of the WMT11 and the WMT12 metrics shared tasks (Callison-Burch et al., 2011; Callison-Burch et al., 2012).
    Page 2, “Introduction”
  6. A common argument, is that current automatic evaluation metrics such as BLEU are inadequate to capture discourse-related aspects of translation quality (Hardmeier and Federico, 2010; Meyer et al., 2012).
    Page 2, “Related Work”
  7. Thus, there is consensus that discourse-informed MT evaluation metrics are needed in order to advance research in this direction.
    Page 2, “Related Work”
  8. The field of automatic evaluation metrics for MT is very active, and new metrics are continuously being proposed, especially in the context of the evaluation campaigns that run as part of the Workshops on Statistical Machine Translation (WMT 2008-2012), and NIST Metrics for Machine Translation Challenge (MetricsMATR), among others.
    Page 2, “Related Work”
  9. In this work, instead of proposing a new metric, we focus on enriching current MT evaluation metrics with discourse information.
    Page 2, “Related Work”
  10. In order to develop a discourse-aware evaluation metric , we first generate discourse trees for the reference and the system-translated sentences using a discourse parser, and then we measure the similarity between the two discourse trees.
    Page 3, “Our Discourse-Based Measures”
  11. 4.1 MT Evaluation Metrics
    Page 4, “Experimental Setup”

See all papers in Proc. ACL 2014 that mention evaluation metrics.

See all papers in Proc. ACL that mention evaluation metrics.

Back to top.

human judgments

Appears in 16 sentences as: Human Judgements (1) human judgments (15)
In Using Discourse Structure Improves Machine Translation Evaluation
  1. Here we suggest some simple ways to create such metrics, and we also show that they yield better correlation with human judgments .
    Page 2, “Related Work”
  2. However, they could not improve correlation with human judgments , as evaluated on the MetricsMATR dataset.
    Page 2, “Related Work”
  3. Compared to the previous work, (i) we use a different discourse representation (RST), (ii) we compare discourse parses using all-subtree kernels (Collins and Duffy, 2001), (iii) we evaluate on much larger datasets, for several language pairs and for multiple metrics, and (iv) we do demonstrate better correlation with human judgments .
    Page 3, “Related Work”
  4. For BLEU and TER, they observed improved correlation with human judgments on the MTC4 dataset when linearly interpolating these metrics with their lexical cohesion score.
    Page 3, “Related Work”
  5. We measured the correlation of the metrics with the human judgments provided by the organizers.
    Page 4, “Experimental Setup”
  6. 4.2 Human Judgements and Learning
    Page 6, “Experimental Setup”
  7. As in the WMT12 experimental setup, we use these rankings to calculate correlation with human judgments at the sentence-level, i.e.
    Page 6, “Experimental Setup”
  8. Unlike PRO, (i) we use human judgments , not automatic scores, and (ii) we train on all pairs, not on a subsample.
    Page 6, “Experimental Setup”
  9. Spearman’s correlation with human judgments .
    Page 7, “Experimental Results”
  10. Overall, we observe an average improvement of +.024, in the correlation with the human judgments .
    Page 7, “Experimental Results”
  11. Kendall’s Tau with human judgments .
    Page 7, “Experimental Results”

See all papers in Proc. ACL 2014 that mention human judgments.

See all papers in Proc. ACL that mention human judgments.

Back to top.

BLEU

Appears in 12 sentences as: BLEU (12)
In Using Discourse Structure Improves Machine Translation Evaluation
  1. A common argument, is that current automatic evaluation metrics such as BLEU are inadequate to capture discourse-related aspects of translation quality (Hardmeier and Federico, 2010; Meyer et al., 2012).
    Page 2, “Related Work”
  2. For BLEU and TER, they observed improved correlation with human judgments on the MTC4 dataset when linearly interpolating these metrics with their lexical cohesion score.
    Page 3, “Related Work”
  3. To complement the set of individual metrics that participated at the WMT12 metrics task, we also computed the scores of other commonly-used evaluation metrics: BLEU (Papineni et al., 2002), NIST (Doddington, 2002), TER (Snover et al., 2006), ROUGE-W (Lin, 2004), and three METEOR variants (Denkowski and Lavie, 2011): METEOR-ex (exact match), METEOR-st (+stemming) and METEOR-sy (+synonyms).
    Page 5, “Experimental Setup”
  4. Combination of five metrics based on lexical similarity: BLEU , NIST, METEOR-ex, ROUGE-W, and TERp-A.
    Page 5, “Experimental Setup”
  5. Group III: contains other important evaluation metrics, which were not considered in the WMT12 metrics task: NIST and ROUGE for both system- and segment-level, and BLEU and TER at segment-level.
    Page 6, “Experimental Results”
  6. II TER .812 .836 .848 BLEU .810 .830 .846
    Page 7, “Experimental Results”
  7. We can see that DR is already competitive by itself: on average, it has a correlation of .807, very close to BLEU and TER scores (.810 and .812, respectively).
    Page 7, “Experimental Results”
  8. HI TER .217 .179 .229 BLEU .185 .154 .190
    Page 7, “Experimental Results”
  9. BLEU .185 — .189 .194
    Page 8, “Experimental Results”
  10. Since the metrics that participated in WMT11 and WMT12 are different (and even when they have the same name, there is no guarantee that they have not changed from 2011 to 2012), we only report results for the versions of NIST, ROUGE, TER, and BLEU available in ASIYA, as well as for the ASIYA metrics, thus ensuring that the metrics in the experiments are consistent for 2011 and 2012.
    Page 9, “Experimental Results”
  11. On the contrary, DR and DR-LEX significantly improve over NIST, ROUGE, TER, and BLEU .
    Page 9, “Experimental Results”

See all papers in Proc. ACL 2014 that mention BLEU.

See all papers in Proc. ACL that mention BLEU.

Back to top.

sentence-level

Appears in 12 sentences as: sentence-level (13)
In Using Discourse Structure Improves Machine Translation Evaluation
  1. From its foundations, Statistical Machine Translation (SMT) had two defining characteristics: first, translation was modeled as a generative process at the sentence-level .
    Page 1, “Introduction”
  2. Recently, there have been two promising research directions for improving SMT and its evaluation: (a) by using more structured linguistic information, such as syntax (Galley et al., 2004; Quirk et al., 2005), hierarchical structures (Chiang, 2005), and semantic roles (Wu and Fung, 2009; Lo et al., 2012), and (b) by going beyond the sentence-level , e.g., translating at the document level (Hardmeier et al., 2012).
    Page 1, “Introduction”
  3. Going beyond the sentence-level is important since sentences rarely stand on their own in a well-written text.
    Page 1, “Introduction”
  4. These metrics tasks are based on sentence-level evaluation, which arguably can limit the benefits of using global discourse properties.
    Page 2, “Introduction”
  5. Furthermore, sentence-level scoring (i) is compatible with most translation systems, which work on a sentence-by-sentence basis, (ii) could be beneficial to modern MT tuning mechanisms such as PRO (Hopkins and May, 2011) and MIRA (Watanabe et al., 2007; Chiang et al., 2008), which also work at the sentence-level , and (iii) could be used for re-ranking n-best lists of translation hypotheses.
    Page 2, “Introduction”
  6. Unlike their work, which measures lexical cohesion at the document-level, here we are concerned with coherence (rhetorical) structure, primarily at the sentence-level .
    Page 3, “Related Work”
  7. As in the WMT12 experimental setup, we use these rankings to calculate correlation with human judgments at the sentence-level , i.e.
    Page 6, “Experimental Setup”
  8. We speculate that this might be caused by the fact that the lexical information in DR-LEX is incorporated only in the form of unigram matching at the sentence-level , while the metrics in group IV are already complex combined metrics, which take into account stronger lexical models.
    Page 7, “Experimental Results”
  9. This is remarkable given that DR has a strong negative Tau as an individual metric at the sentence-level .
    Page 9, “Experimental Results”
  10. Our results show that discourse-based metrics can improve the state-of-the-art MT metrics, by increasing correlation with human judgments, even when only sentence-level discourse information is used.
    Page 9, “Conclusions and Future Work”
  11. First, at the sentence-level , we can use discourse information to re-rank alternative MT hypotheses; this could be applied either for MT parameter tuning, or as a postprocessing step for the MT output.
    Page 9, “Conclusions and Future Work”

See all papers in Proc. ACL 2014 that mention sentence-level.

See all papers in Proc. ACL that mention sentence-level.

Back to top.

TER

Appears in 11 sentences as: TER (11)
In Using Discourse Structure Improves Machine Translation Evaluation
  1. For BLEU and TER , they observed improved correlation with human judgments on the MTC4 dataset when linearly interpolating these metrics with their lexical cohesion score.
    Page 3, “Related Work”
  2. From the original ULC, we only replaced TER and Meteor individual metrics by newer versions taking into account synonymy lookup and paraphrasing: TERp-A and METEOR-pa in ASIYA’s terminology.
    Page 5, “Experimental Setup”
  3. To complement the set of individual metrics that participated at the WMT12 metrics task, we also computed the scores of other commonly-used evaluation metrics: BLEU (Papineni et al., 2002), NIST (Doddington, 2002), TER (Snover et al., 2006), ROUGE-W (Lin, 2004), and three METEOR variants (Denkowski and Lavie, 2011): METEOR-ex (exact match), METEOR-st (+stemming) and METEOR-sy (+synonyms).
    Page 5, “Experimental Setup”
  4. Group III: contains other important evaluation metrics, which were not considered in the WMT12 metrics task: NIST and ROUGE for both system- and segment-level, and BLEU and TER at segment-level.
    Page 6, “Experimental Results”
  5. II TER .812 .836 .848 BLEU .810 .830 .846
    Page 7, “Experimental Results”
  6. We can see that DR is already competitive by itself: on average, it has a correlation of .807, very close to BLEU and TER scores (.810 and .812, respectively).
    Page 7, “Experimental Results”
  7. HI TER .217 .179 .229 BLEU .185 .154 .190
    Page 7, “Experimental Results”
  8. III ROUGE .185 — .196 .218 TER .217 — .229 .246
    Page 8, “Experimental Results”
  9. Since the metrics that participated in WMT11 and WMT12 are different (and even when they have the same name, there is no guarantee that they have not changed from 2011 to 2012), we only report results for the versions of NIST, ROUGE, TER , and BLEU available in ASIYA, as well as for the ASIYA metrics, thus ensuring that the metrics in the experiments are consistent for 2011 and 2012.
    Page 9, “Experimental Results”
  10. On the contrary, DR and DR-LEX significantly improve over NIST, ROUGE, TER , and BLEU.
    Page 9, “Experimental Results”
  11. III ROUGE .205 — .218 .242 TER .262 — .274 .296
    Page 9, “Conclusions and Future Work”

See all papers in Proc. ACL 2014 that mention TER.

See all papers in Proc. ACL that mention TER.

Back to top.

discourse parser

Appears in 10 sentences as: discourse parse (2) discourse parser (4) discourse parsers (1) discourse parses (1) discourse parsing (2)
In Using Discourse Structure Improves Machine Translation Evaluation
  1. We first design two discourse-aware similarity measures, which use all-subtree kernels to compare discourse parse trees in accordance with the Rhetorical Structure Theory.
    Page 1, “Abstract”
  2. One possible reason could be the unavailability of accurate discourse parsers .
    Page 1, “Introduction”
  3. We first design two discourse-aware similarity measures, which use DTs generated by a publicly-available discourse parser (J oty et al., 2012); then, we show that they can help improve a number of MT evaluation metrics at the segment- and at the system-level in the context of the WMT11 and the WMT12 metrics shared tasks (Callison-Burch et al., 2011; Callison-Burch et al., 2012).
    Page 2, “Introduction”
  4. Compared to the previous work, (i) we use a different discourse representation (RST), (ii) we compare discourse parses using all-subtree kernels (Collins and Duffy, 2001), (iii) we evaluate on much larger datasets, for several language pairs and for multiple metrics, and (iv) we do demonstrate better correlation with human judgments.
    Page 3, “Related Work”
  5. In order to develop a discourse-aware evaluation metric, we first generate discourse trees for the reference and the system-translated sentences using a discourse parser , and then we measure the similarity between the two discourse trees.
    Page 3, “Our Discourse-Based Measures”
  6. In Rhetorical Structure Theory, discourse analysis involves two subtasks: (i) discourse segmentation, or breaking the text into a sequence of EDUs, and (ii) discourse parsing , or the task of linking the units (EDUs and larger discourse units) into labeled discourse trees.
    Page 3, “Our Discourse-Based Measures”
  7. (2012) proposed discriminative models for both discourse segmentation and discourse parsing at the sentence level.
    Page 3, “Our Discourse-Based Measures”
  8. The discourse parser uses a dynamic Conditional Random Field (Sutton et al., 2007) as a parsing model in order to infer the probability of all possible discourse tree constituents.
    Page 3, “Our Discourse-Based Measures”
  9. 2The discourse parser is freely available from http://a1t.qcri.org/tools/
    Page 3, “Our Discourse-Based Measures”
  10. First, we defined two simple discourse-aware similarity metrics (lexicalized and un-lexicalized), which use the all-subtree kernel to compute similarity between discourse parse trees in accordance with the Rhetorical Structure Theory.
    Page 9, “Conclusions and Future Work”

See all papers in Proc. ACL 2014 that mention discourse parser.

See all papers in Proc. ACL that mention discourse parser.

Back to top.

language pairs

Appears in 10 sentences as: language pair (2) language pairs (8)
In Using Discourse Structure Improves Machine Translation Evaluation
  1. Compared to the previous work, (i) we use a different discourse representation (RST), (ii) we compare discourse parses using all-subtree kernels (Collins and Duffy, 2001), (iii) we evaluate on much larger datasets, for several language pairs and for multiple metrics, and (iv) we do demonstrate better correlation with human judgments.
    Page 3, “Related Work”
  2. In our experiments, we used the data available for the WMT12 and the WMTll metrics shared tasks for translations into English.3 This included the output from the systems that participated in the WMT12 and the WMTll MT evaluation campaigns, both consisting of 3,003 sentences, for four different language pairs : Czech-English (CS-EN), French-English (FR-EN), German-English (DE-EN), and Spanish-English (ES-EN); as well as a dataset with the English references.
    Page 4, “Experimental Setup”
  3. Table 1: Number of systems (systs), judgments (ranks), unique sentences (sents), and different judges (judges) for the different language pairs , for the human evaluation of the WMT12 and WMT11 shared tasks.
    Page 5, “Experimental Setup”
  4. In order to make the scores of the different metrics comparable, we performed a min—max normalization, for each metric, and for each language pair combination.
    Page 6, “Experimental Setup”
  5. We only present the average results over all four language pairs .
    Page 6, “Experimental Results”
  6. Group II: includes the metrics that participated in the WMT12 metrics task, excluding metrics which did not have results for all language pairs .
    Page 6, “Experimental Results”
  7. Note that, even though DR-LEX has better individual performance than DR, it does not yield improvements when combined with most of the metrics in group IV.8 However, over all metrics and all language pairs , DR-LEX is able to obtain an average improvement in correlation of +.
    Page 7, “Experimental Results”
  8. We aggregated the data for different language pairs , and produced a single set of tuning weights for all language pairs.9 We then used the remaining fold for evaluation
    Page 8, “Experimental Results”
  9. As in previous sections we present the average results over all four language pairs .
    Page 8, “Experimental Results”
  10. 9Tuning separately for each language pair yielded slightly lower results.
    Page 8, “Experimental Results”

See all papers in Proc. ACL 2014 that mention language pairs.

See all papers in Proc. ACL that mention language pairs.

Back to top.

machine translation

Appears in 9 sentences as: Machine Translation (4) machine translation (6)
In Using Discourse Structure Improves Machine Translation Evaluation
  1. We present experiments in using discourse structure for improving machine translation evaluation.
    Page 1, “Abstract”
  2. Then, we show that these measures can help improve a number of existing machine translation evaluation metrics both at the segment- and at the system-level.
    Page 1, “Abstract”
  3. From its foundations, Statistical Machine Translation (SMT) had two defining characteristics: first, translation was modeled as a generative process at the sentence-level.
    Page 1, “Introduction”
  4. This is demonstrated by the establishment of a recent workshop dedicated to Discourse in Machine Translation (Webber et al., 2013), collocated with the 2013 annual meeting of the Association of Computational Linguistics.
    Page 1, “Introduction”
  5. The area of discourse analysis for SMT is still nascent and, to the best of our knowledge, no previous research has attempted to use rhetorical structure for SMT or machine translation evaluation.
    Page 1, “Introduction”
  6. Addressing discourse-level phenomena in machine translation is relatively new as a research direction.
    Page 2, “Related Work”
  7. The field of automatic evaluation metrics for MT is very active, and new metrics are continuously being proposed, especially in the context of the evaluation campaigns that run as part of the Workshops on Statistical Machine Translation (WMT 2008-2012), and NIST Metrics for Machine Translation Challenge (MetricsMATR), among others.
    Page 2, “Related Work”
  8. In this section, we explore how discourse information can be used to improve machine translation evaluation metrics.
    Page 6, “Experimental Results”
  9. Overall, from the experimental results in this section, we can conclude that discourse structure is an important information source to be taken into account in the automatic evaluation of machine translation output.
    Page 9, “Experimental Results”

See all papers in Proc. ACL 2014 that mention machine translation.

See all papers in Proc. ACL that mention machine translation.

Back to top.

NIST

Appears in 9 sentences as: NIST (9)
In Using Discourse Structure Improves Machine Translation Evaluation
  1. The field of automatic evaluation metrics for MT is very active, and new metrics are continuously being proposed, especially in the context of the evaluation campaigns that run as part of the Workshops on Statistical Machine Translation (WMT 2008-2012), and NIST Metrics for Machine Translation Challenge (MetricsMATR), among others.
    Page 2, “Related Work”
  2. To complement the set of individual metrics that participated at the WMT12 metrics task, we also computed the scores of other commonly-used evaluation metrics: BLEU (Papineni et al., 2002), NIST (Doddington, 2002), TER (Snover et al., 2006), ROUGE-W (Lin, 2004), and three METEOR variants (Denkowski and Lavie, 2011): METEOR-ex (exact match), METEOR-st (+stemming) and METEOR-sy (+synonyms).
    Page 5, “Experimental Setup”
  3. Combination of five metrics based on lexical similarity: BLEU, NIST , METEOR-ex, ROUGE-W, and TERp-A.
    Page 5, “Experimental Setup”
  4. Group III: contains other important evaluation metrics, which were not considered in the WMT12 metrics task: NIST and ROUGE for both system- and segment-level, and BLEU and TER at segment-level.
    Page 6, “Experimental Results”
  5. NIST .817 .842 .875
    Page 7, “Experimental Results”
  6. NIST .214 .172 .206 ROUGE .185 .144 .201
    Page 7, “Experimental Results”
  7. 186 — .181 .196 XENERRCATS .165 — .175 .194 POSF .154 — .160 .201 WORDBLOCKEC .153 — .161 .189 BLOCKERRCATS .074 — .087 .150 NIST .214 — .222 .224
    Page 8, “Experimental Results”
  8. Since the metrics that participated in WMT11 and WMT12 are different (and even when they have the same name, there is no guarantee that they have not changed from 2011 to 2012), we only report results for the versions of NIST , ROUGE, TER, and BLEU available in ASIYA, as well as for the ASIYA metrics, thus ensuring that the metrics in the experiments are consistent for 2011 and 2012.
    Page 9, “Experimental Results”
  9. On the contrary, DR and DR-LEX significantly improve over NIST , ROUGE, TER, and BLEU.
    Page 9, “Experimental Results”

See all papers in Proc. ACL 2014 that mention NIST.

See all papers in Proc. ACL that mention NIST.

Back to top.

discourse structure

Appears in 4 sentences as: discourse structure (4)
In Using Discourse Structure Improves Machine Translation Evaluation
  1. We present experiments in using discourse structure for improving machine translation evaluation.
    Page 1, “Abstract”
  2. Our experiments show that many existing metrics can benefit from additional knowledge about discourse structure .
    Page 2, “Related Work”
  3. Overall, from the experimental results in this section, we can conclude that discourse structure is an important information source to be taken into account in the automatic evaluation of machine translation output.
    Page 9, “Experimental Results”
  4. In this paper we have shown that discourse structure can be used to improve automatic MT evaluation.
    Page 9, “Conclusions and Future Work”

See all papers in Proc. ACL 2014 that mention discourse structure.

See all papers in Proc. ACL that mention discourse structure.

Back to top.

tree kernel

Appears in 4 sentences as: tree kernel (2) Tree Kernels (1) Tree kernels (1)
In Using Discourse Structure Improves Machine Translation Evaluation
  1. A number of metrics have been proposed to measure the similarity between two labeled trees, e. g., Tree Edit Distance (Tai, 1979) and Tree Kernels (Collins and Duffy, 2001; Moschitti and Basili, 2006).
    Page 3, “Our Discourse-Based Measures”
  2. Tree kernels (TKs) provide an effective way to integrate arbitrary tree structures in kernel-based machine learning algorithms like SVMs.
    Page 3, “Our Discourse-Based Measures”
  3. the nuclearity and the relations, in order to allow the tree kernel to give partial credit to subtrees that differ in labels but match in their skeletons.
    Page 4, “Our Discourse-Based Measures”
  4. In order to allow the tree kernel to find subtree matches at the word level, we include an additional layer of dummy leaves as was done in (Moschitti et al., 2007); not shown in Figure 2, for simplicity.
    Page 4, “Our Discourse-Based Measures”

See all papers in Proc. ACL 2014 that mention tree kernel.

See all papers in Proc. ACL that mention tree kernel.

Back to top.

lexicalized

Appears in 3 sentences as: lexicalized (3)
In Using Discourse Structure Improves Machine Translation Evaluation
  1. We experiment with TKs applied to two different representations of the discourse tree: non-lexicalized (DR), and lexicalized (DR-LEX).
    Page 4, “Our Discourse-Based Measures”
  2. As expected, DR-LEX performs better than DR since it is lexicalized (at the unigram level), and also gives partial credit to correct structures.
    Page 7, “Experimental Results”
  3. First, we defined two simple discourse-aware similarity metrics ( lexicalized and un-lexicalized), which use the all-subtree kernel to compute similarity between discourse parse trees in accordance with the Rhetorical Structure Theory.
    Page 9, “Conclusions and Future Work”

See all papers in Proc. ACL 2014 that mention lexicalized.

See all papers in Proc. ACL that mention lexicalized.

Back to top.

parse trees

Appears in 3 sentences as: parse trees (3)
In Using Discourse Structure Improves Machine Translation Evaluation
  1. We first design two discourse-aware similarity measures, which use all-subtree kernels to compare discourse parse trees in accordance with the Rhetorical Structure Theory.
    Page 1, “Abstract”
  2. Combination of four metrics based on syntactic information from constituency and dependency parse trees : ‘CP—STM-4’, ‘DP-HWCM_c-4’, ‘DP—HWCM1-4’, and ‘DP-Or(*)’.
    Page 5, “Experimental Setup”
  3. First, we defined two simple discourse-aware similarity metrics (lexicalized and un-lexicalized), which use the all-subtree kernel to compute similarity between discourse parse trees in accordance with the Rhetorical Structure Theory.
    Page 9, “Conclusions and Future Work”

See all papers in Proc. ACL 2014 that mention parse trees.

See all papers in Proc. ACL that mention parse trees.

Back to top.

shared tasks

Appears in 3 sentences as: shared tasks (3)
In Using Discourse Structure Improves Machine Translation Evaluation
  1. We first design two discourse-aware similarity measures, which use DTs generated by a publicly-available discourse parser (J oty et al., 2012); then, we show that they can help improve a number of MT evaluation metrics at the segment- and at the system-level in the context of the WMT11 and the WMT12 metrics shared tasks (Callison-Burch et al., 2011; Callison-Burch et al., 2012).
    Page 2, “Introduction”
  2. In our experiments, we used the data available for the WMT12 and the WMTll metrics shared tasks for translations into English.3 This included the output from the systems that participated in the WMT12 and the WMTll MT evaluation campaigns, both consisting of 3,003 sentences, for four different language pairs: Czech-English (CS-EN), French-English (FR-EN), German-English (DE-EN), and Spanish-English (ES-EN); as well as a dataset with the English references.
    Page 4, “Experimental Setup”
  3. Table 1: Number of systems (systs), judgments (ranks), unique sentences (sents), and different judges (judges) for the different language pairs, for the human evaluation of the WMT12 and WMT11 shared tasks .
    Page 5, “Experimental Setup”

See all papers in Proc. ACL 2014 that mention shared tasks.

See all papers in Proc. ACL that mention shared tasks.

Back to top.

SMT systems

Appears in 3 sentences as: SMT systems (3)
In Using Discourse Structure Improves Machine Translation Evaluation
  1. Although modern SMT systems have switched to a discriminative log-linear framework, which allows for additional sources as features, it is generally hard to incorporate dependencies beyond a small window of adjacent words, thus making it difficult to use linguistically-rich models.
    Page 1, “Introduction”
  2. We believe that the semantic and pragmatic information captured in the form of DTs (i) can help develop discourse-aware SMT systems that produce coherent translations, and (ii) can yield better MT evaluation metrics.
    Page 2, “Introduction”
  3. While in this work we focus on the latter, we think that the former is also within reach, and that SMT systems would benefit from preserving the coherence relations in the source language when generating target-language translations.
    Page 2, “Introduction”

See all papers in Proc. ACL 2014 that mention SMT systems.

See all papers in Proc. ACL that mention SMT systems.

Back to top.

subtrees

Appears in 3 sentences as: subtrees (3)
In Using Discourse Structure Improves Machine Translation Evaluation
  1. In the present work, we use the convolution TK defined in (Collins and Duffy, 2001), which efficiently calculates the number of common subtrees in two trees.
    Page 3, “Our Discourse-Based Measures”
  2. Note that this kernel was originally designed for syntactic parsing, where the subtrees are subject to the constraint that their nodes are taken with either all or none of the children.
    Page 3, “Our Discourse-Based Measures”
  3. the nuclearity and the relations, in order to allow the tree kernel to give partial credit to subtrees that differ in labels but match in their skeletons.
    Page 4, “Our Discourse-Based Measures”

See all papers in Proc. ACL 2014 that mention subtrees.

See all papers in Proc. ACL that mention subtrees.

Back to top.