English-to-Russian MT evaluation campaign
Braslavski, Pavel and Beloborodov, Alexander and Khalilov, Maxim and Sharoff, Serge

Article Structure

Introduction

Machine Translation (MT) between English and Russian was one of the first translation directions tested at the dawn of MT research in the 19508 (Hutchins, 2000).

Corpus preparation

In designing the set of texts for evaluation, we had two issues in mind.

Evaluation methodology

The main idea of manual evaluation was (1) to make the assessment as simple as possible for a human judge and (2) to make the results of evaluation unambiguous.

Results

We received results from five teams, two teams submitted two runs each, which totals seven participants’ runs (referred to as Pl..P7 in the paper).

Conclusions and future plans

This was the first attempt at making proper quantitative and qualitative evaluation of the English—>Russian MT systems.

Topics

human judgements

Appears in 7 sentences as: human judge (2) human judgement (1) human judgements (3) human judges (1)
In English-to-Russian MT evaluation campaign
  1. The main idea of manual evaluation was (1) to make the assessment as simple as possible for a human judge and (2) to make the results of evaluation unambiguous.
    Page 2, “Evaluation methodology”
  2. This task is also much simpler for human judges to complete.
    Page 2, “Evaluation methodology”
  3. The idea is to run a standard sort algorithm and ask a human judge each time a comparison operation is required.
    Page 3, “Evaluation methodology”
  4. For example, if it favours one system against another, while in human judgement they are equal, the final ranking will preserve the initial order.
    Page 3, “Evaluation methodology”
  5. tions of these metrics with human judgements for the English—>Russian pair on the corpus level and on the level of individual sentences.
    Page 3, “Evaluation methodology”
  6. METEOR (with its builtin Russian lemma-tisation) and GTM offer the best correlation with human judgements .
    Page 4, “Results”
  7. Table 3: Correlation to human judgements
    Page 5, “Results”

See all papers in Proc. ACL 2013 that mention human judgements.

See all papers in Proc. ACL that mention human judgements.

Back to top.

MT systems

Appears in 5 sentences as: MT systems (5)
In English-to-Russian MT evaluation campaign
  1. One of the main challenges in developing MT systems for Russian and for evaluating them is the need to deal with its free word order and complex morphology.
    Page 1, “Introduction”
  2. We chose to retain the entire texts in the corpus rather than individual sentences, since some MT systems may use information beyond isolated sentences.
    Page 2, “Corpus preparation”
  3. In our case the assessors were asked to make a pairwise comparison of two sentences translated by two different MT systems against a gold standard translation.
    Page 2, “Evaluation methodology”
  4. This was the first attempt at making proper quantitative and qualitative evaluation of the English—>Russian MT systems .
    Page 5, “Conclusions and future plans”
  5. We have made the corpus comprising the source sentences, their human translations, translations by participating MT systems and the human evaluation data publicly available.8
    Page 5, “Conclusions and future plans”

See all papers in Proc. ACL 2013 that mention MT systems.

See all papers in Proc. ACL that mention MT systems.

Back to top.

word order

Appears in 4 sentences as: word order (4)
In English-to-Russian MT evaluation campaign
  1. One of the main challenges in developing MT systems for Russian and for evaluating them is the need to deal with its free word order and complex morphology.
    Page 1, “Introduction”
  2. While TER and GTM are known to provide better correlation with post-editing efforts for English (O’Brien, 2011), free word order and greater data sparseness on the sentence level makes TER much less reliable for Russian.
    Page 4, “Results”
  3. We will also address the problem of tailoring automatic evaluation measures to Russian — accounting for complex morphology and free word order .
    Page 5, “Conclusions and future plans”
  4. While the campaign was based exclusively on data in one language direction, the correlation results for automatic MT quality measures should be applicable to other languages with free word order and complex morphology.
    Page 5, “Conclusions and future plans”

See all papers in Proc. ACL 2013 that mention word order.

See all papers in Proc. ACL that mention word order.

Back to top.

BLEU

Appears in 3 sentences as: BLEU (3)
In English-to-Russian MT evaluation campaign
  1. In addition to human evaluation, we also ran system-level automatic evaluations using BLEU (Papineni et al., 2001), NIST (Doddington, 2002), METEOR (Banerjee and Lavie, 2005), TER (Snover et al., 2009), and GTM (Turian et al., 2003).
    Page 3, “Evaluation methodology”
  2. 081 usually has the highest overall score (except BLEU ), it also has the highest scores for ‘regulations’ (more formal texts), P1 scores are better for the news documents.
    Page 4, “Results”
  3. Sentence level Corpus Metric Median Mean Trimmed level BLEU 0.357 0.298 0.348 0.833 NIST 0.357 0.291 0.347 0.810 Meteor 0.429 0.348 0.393 0.714 TER 0.214 0.186 0.204 0.619 GTM 0.429 0.340 0.392 0.714
    Page 5, “Results”

See all papers in Proc. ACL 2013 that mention BLEU.

See all papers in Proc. ACL that mention BLEU.

Back to top.

manual evaluation

Appears in 3 sentences as: manual evaluation (4)
In English-to-Russian MT evaluation campaign
  1. For manual evaluation , we randomly selected 330 sentences out of 947 used for automatic evaluation, specifically, 190 from the ‘news’ part and 140 from the ‘regulations’ part.
    Page 2, “Corpus preparation”
  2. The main idea of manual evaluation was (1) to make the assessment as simple as possible for a human judge and (2) to make the results of evaluation unambiguous.
    Page 2, “Evaluation methodology”
  3. For 11 runs automatic evaluation measures were calculated; eight runs underwent manual evaluation (four online systems plus four participants’ runs; no manual evaluation was done by agreement with the participants for the runs P3, P6, and P7 to reduce the workload).
    Page 3, “Results”

See all papers in Proc. ACL 2013 that mention manual evaluation.

See all papers in Proc. ACL that mention manual evaluation.

Back to top.

NIST

Appears in 3 sentences as: NIST (3)
In English-to-Russian MT evaluation campaign
  1. In addition to human evaluation, we also ran system-level automatic evaluations using BLEU (Papineni et al., 2001), NIST (Doddington, 2002), METEOR (Banerjee and Lavie, 2005), TER (Snover et al., 2009), and GTM (Turian et al., 2003).
    Page 3, “Evaluation methodology”
  2. The lower part of Table 2 also reports the results of simulated dynamic ranking (using the NIST rankings as the initial order for the sort operation).
    Page 4, “Results”
  3. Sentence level Corpus Metric Median Mean Trimmed level BLEU 0.357 0.298 0.348 0.833 NIST 0.357 0.291 0.347 0.810 Meteor 0.429 0.348 0.393 0.714 TER 0.214 0.186 0.204 0.619 GTM 0.429 0.340 0.392 0.714
    Page 5, “Results”

See all papers in Proc. ACL 2013 that mention NIST.

See all papers in Proc. ACL that mention NIST.

Back to top.

Phrase-based

Appears in 3 sentences as: Phrase-based (2) phrase-based (1)
In English-to-Russian MT evaluation campaign
  1. Long-distance dependencies are common, and this creates problems for both RBMT and SMT systems (especially for phrase-based ones).
    Page 1, “Introduction”
  2. 081 Phrase-based SMT
    Page 3, “Results”
  3. 082 Phrase-based SMT
    Page 3, “Results”

See all papers in Proc. ACL 2013 that mention Phrase-based.

See all papers in Proc. ACL that mention Phrase-based.

Back to top.

TER

Appears in 3 sentences as: TER (4)
In English-to-Russian MT evaluation campaign
  1. In addition to human evaluation, we also ran system-level automatic evaluations using BLEU (Papineni et al., 2001), NIST (Doddington, 2002), METEOR (Banerjee and Lavie, 2005), TER (Snover et al., 2009), and GTM (Turian et al., 2003).
    Page 3, “Evaluation methodology”
  2. While TER and GTM are known to provide better correlation with post-editing efforts for English (O’Brien, 2011), free word order and greater data sparseness on the sentence level makes TER much less reliable for Russian.
    Page 4, “Results”
  3. Sentence level Corpus Metric Median Mean Trimmed level BLEU 0.357 0.298 0.348 0.833 NIST 0.357 0.291 0.347 0.810 Meteor 0.429 0.348 0.393 0.714 TER 0.214 0.186 0.204 0.619 GTM 0.429 0.340 0.392 0.714
    Page 5, “Results”

See all papers in Proc. ACL 2013 that mention TER.

See all papers in Proc. ACL that mention TER.

Back to top.