The Three-way Decision Task | The answer key for the three-way decision task was developed at the National Institute of Standards and Technology ( NIST ) using annotators who had experience as TREC and DUC assessors. |
The Three-way Decision Task | NIST assessors annotated all 800 entailment pairs in the test set, with each pair independently annotated by two different assessors. |
The Three-way Decision Task | The three-way answer key was formed by keeping exactly the same set of YES answers as in the two-way key (regardless of the NIST annotations) and having NIST staff adjudicate assessor differences on the remainder. |
Metric Design Considerations | To evaluate our metric, we conduct experiments on datasets from the ACL-07 MT workshop and NIST |
Metric Design Considerations | Table 4: Correlations on the NIST MT 2003 dataset. |
Metric Design Considerations | 5.2 NIST MT 2003 Dataset |
Abstract | We integrate our method into a state-of-the-art baseline translation system and show that it consistently improves the performance of the baseline system on various NIST MT test sets. |
Conclusions | We integrate our method into a state-of-the-art phrase-based baseline translation system, i.e., Moses (Koehn et al., 2007), and show that the integrated system consistently improves the performance of the baseline system on various NIST machine translation test sets. |
Experimental Results | We compile a parallel dataset which consists of various corpora distributed by the Linguistic Data Consortium (LDC) for NIST MT evaluation. |
Experimental Results | 4.5.2 BLEU on NIST MT Test Sets |
Experimental Results | Table 7 reports the results on various NIST MT test sets. |
Introduction | We carry out experiments on a state-of-the-art SMT system, i.e., Moses (Koehn et al., 2007), and show that the abbreviation translations consistently improve the translation performance (in terms of BLEU (Papineni et al., 2002)) on various NIST MT test sets. |
Conclusions and Future Work | Our string-to-dependency system generates 80% fewer rules, and achieves 1.48 point improvement in BLEU and 2.53 point improvement in TER on the decoding output on the NIST 04 Chinese-English evaluation set. |
Experiments | We used part of the NIST 2006 Chinese-English large track data as well as some LDC corpora collected for the DARPA GALE program (LDC2005E83, LDC2006E34 and LDC2006G05) as our bilingual training data. |
Experiments | We tuned the weights on NIST MT05 and tested on MT04. |
Introduction | For example, Chiang (2007) showed that the Hiero system achieved about 1 to 3 point improvement in BLEU on the NIST 03/04/05 Chinese-English evaluation sets compared to a start-of-the-art phrasal system. |
Introduction | Our string-to-dependency decoder shows 1.48 point improvement in BLEU and 2.53 point improvement in TER on the NIST 04 Chinese-English MT evaluation set. |
Experiments | Results were evaluated with both BLEU (Papineni et al., 2001) and NIST metrics ( NIST , 2002). |
Experiments | BLEU NIST set devtest test07 devtest test07 baseline 18.13 18.05 5.218 5.279 person 18.16 18.17 5.224 5.316 |
Experiments | The NIST metric clearly shows a significant improvement, because it mostly measures difficult n-gram matches (e. g. due to the long-distance rules we have been dealing with). |
Experiments | We run an improved version of our 2006 NIST MT Evaluation entry for the Arabic-English “Unlimited” data track.6 The language model is the same one as in the previous section. |
Experiments | We use MT04 data for system development, with MT05 data and MT06 ( “NIST” subset) data for blind testing. |
Experiments | Overall, our baseline results compare favorably to those reported on the NIST MT06 web site. |
Abstract | Experimental results on the NIST MT-2005 Chinese-English translation task show that our method statistically significantly outperforms the baseline systems. |
Conclusions and Future Work | The experimental results on the NIST MT-2005 Chinese-English translation task demonstrate the effectiveness of the proposed model. |
Experiments | We used sentences with less than 50 characters from the NIST MT-2002 test set as our development set and the NIST MT-2005 test set as our test set. |
Introduction | Experiment results on the NIST MT-2005 Chinese-English translation task show that our method significantly outperforms Moses (Koehn et al., 2007), a state-of-the-art phrase-based SMT system, and other linguistically syntax-based methods, such as SCFG-based and STSG-based methods (Zhang et al., 2007). |