Automatic Evaluation Method for Machine Translation Using Noun-Phrase Chunking
Echizen-ya, Hiroshi and Araki, Kenji

Article Structure

Abstract

As described in this paper, we propose a new automatic evaluation method for machine translation using noun-phrase chunking.

Introduction

High-quality automatic evaluation has become increasingly important as various machine translation systems have developed.

Automatic Evaluation Method using Noun-Phrase Chunking

The system based on our method has four processes.

Experiments

3.1 Experimental Procedure

Conclusion

As described herein, we proposed a new automatic evaluation method for machine transla-

Topics

noun phrases

Appears in 40 sentences as: Noun Phrases (1) noun phrases (44)
In Automatic Evaluation Method for Machine Translation Using Noun-Phrase Chunking
  1. Our method correctly determines the matching words between two sentences using corresponding noun phrases .
    Page 1, “Abstract”
  2. Using noun phrases produced by chunking, our method yields the correct word correspondences and determines the similarity between two sentences in terms of the noun phrase order of appearance.
    Page 1, “Introduction”
  3. spondences of noun phrases between MT outputs and references using chunking.
    Page 2, “Automatic Evaluation Method using Noun-Phrase Chunking”
  4. Secondly, the system calculates word-level scores based on the correct matched words using the determined correspondences of noun phrases .
    Page 2, “Automatic Evaluation Method using Noun-Phrase Chunking”
  5. 2.1 Correspondence of Noun Phrases by Chunking
    Page 2, “Automatic Evaluation Method using Noun-Phrase Chunking”
  6. The system obtains the noun phrases from each sentence by chunking.
    Page 2, “Automatic Evaluation Method using Noun-Phrase Chunking”
  7. It then determines corresponding noun phrases between MT outputs and references calculating the similarity for two noun phrases by the PER score(Su et al., 1992).
    Page 2, “Automatic Evaluation Method using Noun-Phrase Chunking”
  8. The high score represents that the similarity between two noun phrases is high.
    Page 2, “Automatic Evaluation Method using Noun-Phrase Chunking”
  9. Figure 1 presents an example of the determination of the corresponding noun phrases .
    Page 2, “Automatic Evaluation Method using Noun-Phrase Chunking”
  10. (2) Determination of corresponding noun phrases MT output : in general , [NP the amount] of [NP the crowning fall ]
    Page 2, “Automatic Evaluation Method using Noun-Phrase Chunking”
  11. Figure 1: Example of determination of corresponding noun phrases .
    Page 2, “Automatic Evaluation Method using Noun-Phrase Chunking”

See all papers in Proc. ACL 2010 that mention noun phrases.

See all papers in Proc. ACL that mention noun phrases.

Back to top.

machine translation

Appears in 13 sentences as: machine translation (14)
In Automatic Evaluation Method for Machine Translation Using Noun-Phrase Chunking
  1. As described in this paper, we propose a new automatic evaluation method for machine translation using noun-phrase chunking.
    Page 1, “Abstract”
  2. Evaluation experiments were conducted to calculate the correlation among human judgments, along with the scores produced using automatic evaluation methods for MT outputs obtained from the 12 machine translation systems in NTCIR—7.
    Page 1, “Abstract”
  3. High-quality automatic evaluation has become increasingly important as various machine translation systems have developed.
    Page 1, “Introduction”
  4. Evaluation experiments using MT outputs obtained by 12 machine translation systems in NTCIR—7(Fujii et al., 2008) demonstrate that the scores obtained using our system yield the highest correlation with the human judgments among the automatic evaluation methods in both sentence-level adequacy and fluency.
    Page 1, “Introduction”
  5. Results confirmed that our method using noun-phrase chunking is effective for automatic evaluation for machine translation .
    Page 1, “Introduction”
  6. These English output sentences are sentences that 12 machine translation systems in NTCIR—7 translated from 100 Japanese sentences.
    Page 5, “Experiments”
  7. Table 1 presents types of the 12 machine translation systems.
    Page 5, “Experiments”
  8. 12 machine translation systems in respective automatic evaluation methods, and “All” are the correlation coefficients using the scores of 1,200 output sentences obtained using the 12 machine translation systems.
    Page 6, “Experiments”
  9. Moreover, we investigated the correlation of machine translation systems of every type.
    Page 6, “Experiments”
  10. translation systems in SMT and the scores of 200 output sentences obtained by 2 machine translation systems in RBMT are used respectively.
    Page 7, “Experiments”
  11. for BLEU with our method are higher than those of BLEU in any machine translation system, “Avg.” and “All” in Tables 2—5.
    Page 7, “Experiments”

See all papers in Proc. ACL 2010 that mention machine translation.

See all papers in Proc. ACL that mention machine translation.

Back to top.

sentence-level

Appears in 11 sentences as: sentence-level (11)
In Automatic Evaluation Method for Machine Translation Using Noun-Phrase Chunking
  1. Experimental results show that our method obtained the highest correlations among the methods in both sentence-level adequacy and fluency.
    Page 1, “Abstract”
  2. However, sentence-level automatic evaluation is insufficient.
    Page 1, “Introduction”
  3. As described herein, for use with MT systems, we propose a new automatic evaluation method using noun-phrase chunking to obtain higher sentence-level correlations.
    Page 1, “Introduction”
  4. Evaluation experiments using MT outputs obtained by 12 machine translation systems in NTCIR—7(Fujii et al., 2008) demonstrate that the scores obtained using our system yield the highest correlation with the human judgments among the automatic evaluation methods in both sentence-level adequacy and fluency.
    Page 1, “Introduction”
  5. We calculated Pearson’s correlation efficient and Spearman’s rank correlation efficient between the scores obtained using our method and the scores by human judgments in terms of sentence-level adequacy and fluency.
    Page 5, “Experiments”
  6. Tables 2 and 3 respectively show Pearson’s correlation coefficient for sentence-level adequacy and fluency.
    Page 6, “Experiments”
  7. Tables 4 and 5 respectively show Spearman’s rank correlation coefficient for sentence-level adequacy and fluency.
    Page 6, “Experiments”
  8. Especially in terms of the sentence-level adequacy shown in Tables 2 and 4, “Avg.” of our method is about 0.03 higher than that of IMPACT.
    Page 6, “Experiments”
  9. Moreover, for sentence-level adequacy, BLEU with our method is significantly better than BLEU in almost all machine translation systems and “All” in Tables 2 and 4.
    Page 7, “Experiments”
  10. Experimental results demonstrate that our method yields the highest correlation among eight methods in terms of sentence-level adequacy and fluency.
    Page 9, “Conclusion”
  11. Future studies Will improve our method, enabling it to achieve high correlation in sentence-level fluency.
    Page 9, “Conclusion”

See all papers in Proc. ACL 2010 that mention sentence-level.

See all papers in Proc. ACL that mention sentence-level.

Back to top.

translation systems

Appears in 11 sentences as: translation system (2) translation systems (11)
In Automatic Evaluation Method for Machine Translation Using Noun-Phrase Chunking
  1. Evaluation experiments were conducted to calculate the correlation among human judgments, along with the scores produced using automatic evaluation methods for MT outputs obtained from the 12 machine translation systems in NTCIR—7.
    Page 1, “Abstract”
  2. High-quality automatic evaluation has become increasingly important as various machine translation systems have developed.
    Page 1, “Introduction”
  3. Evaluation experiments using MT outputs obtained by 12 machine translation systems in NTCIR—7(Fujii et al., 2008) demonstrate that the scores obtained using our system yield the highest correlation with the human judgments among the automatic evaluation methods in both sentence-level adequacy and fluency.
    Page 1, “Introduction”
  4. These English output sentences are sentences that 12 machine translation systems in NTCIR—7 translated from 100 Japanese sentences.
    Page 5, “Experiments”
  5. Table 1 presents types of the 12 machine translation systems .
    Page 5, “Experiments”
  6. 12 machine translation systems in respective automatic evaluation methods, and “All” are the correlation coefficients using the scores of 1,200 output sentences obtained using the 12 machine translation systems .
    Page 6, “Experiments”
  7. Moreover, we investigated the correlation of machine translation systems of every type.
    Page 6, “Experiments”
  8. translation systems in SMT and the scores of 200 output sentences obtained by 2 machine translation systems in RBMT are used respectively.
    Page 7, “Experiments”
  9. for BLEU with our method are higher than those of BLEU in any machine translation system , “Avg.” and “All” in Tables 2—5.
    Page 7, “Experiments”
  10. Moreover, for sentence-level adequacy, BLEU with our method is significantly better than BLEU in almost all machine translation systems and “All” in Tables 2 and 4.
    Page 7, “Experiments”
  11. These results indicate that our method using noun-phrase chunking is effective for some methods and that it is statistically significant in each machine translation system , not only “All”, which has large sentences.
    Page 7, “Experiments”

See all papers in Proc. ACL 2010 that mention translation systems.

See all papers in Proc. ACL that mention translation systems.

Back to top.

human judgments

Appears in 8 sentences as: human judges (2) human judgment (2) human judgments (4)
In Automatic Evaluation Method for Machine Translation Using Noun-Phrase Chunking
  1. Evaluation experiments were conducted to calculate the correlation among human judgments , along with the scores produced using automatic evaluation methods for MT outputs obtained from the 12 machine translation systems in NTCIR—7.
    Page 1, “Abstract”
  2. The scores of some automatic evaluation methods can obtain high correlation with human judgment in document-level automatic evalua-tion(Coughlin, 2007).
    Page 1, “Introduction”
  3. Evaluation experiments using MT outputs obtained by 12 machine translation systems in NTCIR—7(Fujii et al., 2008) demonstrate that the scores obtained using our system yield the highest correlation with the human judgments among the automatic evaluation methods in both sentence-level adequacy and fluency.
    Page 1, “Introduction”
  4. We calculated the correlation between the scores obtained using our method and scores produced by human judgment .
    Page 5, “Experiments”
  5. Moreover, three human judges evaluated 1,200 English output sentences from the perspective of adequacy and fluency on a scale of 1—5.
    Page 5, “Experiments”
  6. We used the median value in the evaluation results of three human judges as the final scores of 1—5.
    Page 5, “Experiments”
  7. We calculated Pearson’s correlation efficient and Spearman’s rank correlation efficient between the scores obtained using our method and the scores by human judgments in terms of sentence-level adequacy and fluency.
    Page 5, “Experiments”
  8. Additionally, we calculated the correlations between the scores using seven other methods and the scores by human judgments to compare our method with other automatic evaluation methods.
    Page 5, “Experiments”

See all papers in Proc. ACL 2010 that mention human judgments.

See all papers in Proc. ACL that mention human judgments.

Back to top.

word-level

Appears in 8 sentences as: Word-level (1) word-level (7)
In Automatic Evaluation Method for Machine Translation Using Noun-Phrase Chunking
  1. Secondly, the system calculates word-level scores based on the correct matched words using the determined correspondences of noun phrases.
    Page 2, “Automatic Evaluation Method using Noun-Phrase Chunking”
  2. The system calculates the final scores combining word-level scores and phrase-level scores.
    Page 2, “Automatic Evaluation Method using Noun-Phrase Chunking”
  3. 2.2 Word-level Score
    Page 2, “Automatic Evaluation Method using Noun-Phrase Chunking”
  4. The system calculates the word-level scores between MT output and reference using the corresponding noun phrases.
    Page 2, “Automatic Evaluation Method using Noun-Phrase Chunking”
  5. Moreover, the word-level score is calculated using the common parts in the selected LCS route as the following Eqs.
    Page 3, “Automatic Evaluation Method using Noun-Phrase Chunking”
  6. In this case, the weight of each common word in the common part is l. The system calculates scorewd as the word-level score in Eq.
    Page 3, “Automatic Evaluation Method using Noun-Phrase Chunking”
  7. The system calculates the final score by combining the word-level score and the phrase-level score as shown in the following Eq.
    Page 5, “Automatic Evaluation Method using Noun-Phrase Chunking”
  8. The system can realize high-quality automatic evaluation using both word-level information and phrase-level information.
    Page 5, “Automatic Evaluation Method using Noun-Phrase Chunking”

See all papers in Proc. ACL 2010 that mention word-level.

See all papers in Proc. ACL that mention word-level.

Back to top.

BLEU

Appears in 7 sentences as: BLEU (9) “BLEU (2)
In Automatic Evaluation Method for Machine Translation Using Noun-Phrase Chunking
  1. Methods based on word strings (6.9., BLEU (Papineni et al., 2002), NIST(NIST, 2002), METEOR(Banerjee and Lavie., 2005), ROUGE-L(Lin and Och, 2004),
    Page 1, “Introduction”
  2. To confirm the effectiveness of noun-phrase chunking, we performed the experiment using a system combining BLEU with our method.
    Page 7, “Experiments”
  3. In this case, BLEU scores were used as scorewd in Eq.
    Page 7, “Experiments”
  4. This experimental result is shown as “BLEU with our method” in Tables 2—5.
    Page 7, “Experiments”
  5. In the results of “BLEU with our method” in Tables 2—5, underlining signifies that the differences between correlation coefficients obtained using BLEU with our method and BLEU alone are statistically significant at the 5% significance level.
    Page 7, “Experiments”
  6. for BLEU with our method are higher than those of BLEU in any machine translation system, “Avg.” and “All” in Tables 2—5.
    Page 7, “Experiments”
  7. Moreover, for sentence-level adequacy, BLEU with our method is significantly better than BLEU in almost all machine translation systems and “All” in Tables 2 and 4.
    Page 7, “Experiments”

See all papers in Proc. ACL 2010 that mention BLEU.

See all papers in Proc. ACL that mention BLEU.

Back to top.

statistically significant

Appears in 7 sentences as: statistically significant (7)
In Automatic Evaluation Method for Machine Translation Using Noun-Phrase Chunking
  1. Moreover, the differences between correlation coefficients obtained using our method and other methods are statistically significant at the 5% or lower significance level for adequacy.
    Page 1, “Introduction”
  2. Underlining in our method signifies that the differences between correlation coefficients obtained using our method and IMPACT are statistically significant at the 5% significance level.
    Page 6, “Experiments”
  3. 8 and “All” of Tables 2 and 4, the differences between correlation coefficients obtained using our method and IMPACT are statistically significant at the 5% significance level.
    Page 6, “Experiments”
  4. The differences between correlation coefficients obtained using our method and IMPACT are statistically significant at the 5% significance level for adequacy of SMT.
    Page 7, “Experiments”
  5. In the results of “BLEU with our method” in Tables 2—5, underlining signifies that the differences between correlation coefficients obtained using BLEU with our method and BLEU alone are statistically significant at the 5% significance level.
    Page 7, “Experiments”
  6. These results indicate that our method using noun-phrase chunking is effective for some methods and that it is statistically significant in each machine translation system, not only “All”, which has large sentences.
    Page 7, “Experiments”
  7. Underlining in “Our method II” of Tables 2—5 signifies that the differences between correlation coefficients obtained using our method II and IMPACT are statistically significant at the 5% significance level.
    Page 8, “Experiments”

See all papers in Proc. ACL 2010 that mention statistically significant.

See all papers in Proc. ACL that mention statistically significant.

Back to top.