Analysis and Discussion | CE_LD CE_SD testset ( NIST ) ’06 ’08 ’06 ’08 |
Analysis and Discussion | Table 4: Results (BLEU%) of Chinese—to—English large data (CE_LD) and small data (CE_SD) NIST task by applying one feature. |
Analysis and Discussion | Table 6: Results (BLEU%) of using simple features based on context on small data NIST task. |
Experiments | The first one is the large data condition, based on training data for the NIST 2 2009 evaluation Chinese-to-English track. |
Experiments | We first created a development set which used mainly data from the NIST 2005 test set, and also some balanced-genre web-text from the NIST training material. |
Experiments | Evaluation was performed on the NIST 2006 and 2008 test sets. |
Abstract | The experimental results on three NIST evaluation test sets show that our method leads to significant improvements in translation accuracy over the baseline systems. |
Background | 2 In this paper, we use the NIST definition of BLEU where the effective reference length is the length of the shortest reference translation. |
Background | The data set used for weight training in boosting-based system combination comes from NIST MTO3 evaluation set. |
Background | The test sets are the NIST evaluation sets of MTO4, MTOS and MTO6. |
Introduction | All the systems are evaluated on three NIST MT evaluation test sets. |
Experiments and Results | Additionally, NIST score (Dod-dington, 2002) and METEOR (Banerjee and La-vie, 2005) are also used to check the consistency of experimental results. |
Experiments and Results | BLEU 0.4029 0.3146 NIST 7.0419 8.8462 METEOR 0.5785 0.5335 |
Experiments and Results | Both SMP and ESSP outperform baseline consistently in BLEU, NIST and METEOR. |
Abstract | We train and test linguistic quality models on consecutive years of NIST evaluation data in order to show the generality of results. |
Conclusion | Automatic evaluation will make testing easier during system development and enable reporting results obtained outside of the cycles of NIST evaluation. |
Introduction | quality and none have been validated on data from NIST evaluations. |
Introduction | We evaluate the predictive power of these linguistic quality metrics by training and testing models on consecutive years of NIST evaluations (data described |
Results and discussion | In both DUC 2006 and DUC 2007, ten NIST assessors wrote summaries for the various inputs. |
Results and discussion | We only report results on the input level, as we are interested in distinguishing between the quality of the summaries, not the NIST assessors’ writing skills. |
Experiments | For the error detection task, we use the best translation hypotheses of NIST MT-02/05/03 generated by MOSES as our training, development, and test corpus respectively. |
SMT System | The translation task is on the official NIST Chinese-to-English evaluation data. |
SMT System | For minimum error rate tuning (Och, 2003), we use NIST MT-02 as the development set for the translation task. |
SMT System | In order to calculate word posterior probabilities, we generate 10,000 best lists for NIST MT-02/03/05 respectively. |