Abstract | On the NIST OpenMT12 Arabic-English condition, the NNJ M features produce a gain of +3.0 BLEU on top of a powerful, feature-rich baseline which already includes a target-only NNLM. |
Introduction | We show primary results on the NIST OpenMT12 Arabic-English condition. |
Introduction | We also show strong improvements on the NIST OpenMT12 Chinese-English task, as well as the DARPA BOLT (Broad Operational Language Translation) Arabic-English and Chinese-English conditions. |
Model Variations | For Arabic word tokenization, we use the MADA-ARZ tokenizer (Habash et al., 2013) for the BOLT condition, and the Sakhr9 tokenizer for the NIST condition. |
Model Variations | We present MT primary results on Arabic-English and Chinese-English for the NIST OpenMT12 and DARPA BOLT conditions. |
Model Variations | 6.1 NIST OpenMT12 Results |
Abstract | Experimental results show that our method significantly improves translation accuracy in the NIST Chinese-to-English translation task compared to a state-of-the-art baseline. |
Experiments | The NIST 2003 dataset is the development data. |
Experiments | The testing data consists of NIST 2004, 2005, 2006 and 2008 datasets. |
Experiments | NIST 2004 |
Introduction | We integrate topic similarity features in the log-linear model and evaluate the performance on the NIST Chinese-to-English translation task. |
Experimental Results | Group III: contains other important evaluation metrics, which were not considered in the WMT12 metrics task: NIST and ROUGE for both system- and segment-level, and BLEU and TER at segment-level. |
Experimental Results | NIST .817 .842 .875 |
Experimental Results | NIST .214 .172 .206 ROUGE .185 .144 .201 |
Experimental Setup | To complement the set of individual metrics that participated at the WMT12 metrics task, we also computed the scores of other commonly-used evaluation metrics: BLEU (Papineni et al., 2002), NIST (Doddington, 2002), TER (Snover et al., 2006), ROUGE-W (Lin, 2004), and three METEOR variants (Denkowski and Lavie, 2011): METEOR-ex (exact match), METEOR-st (+stemming) and METEOR-sy (+synonyms). |
Experimental Setup | Combination of five metrics based on lexical similarity: BLEU, NIST , METEOR-ex, ROUGE-W, and TERp-A. |
Related Work | The field of automatic evaluation metrics for MT is very active, and new metrics are continuously being proposed, especially in the context of the evaluation campaigns that run as part of the Workshops on Statistical Machine Translation (WMT 2008-2012), and NIST Metrics for Machine Translation Challenge (MetricsMATR), among others. |
Abstract | On two Chinese-English tasks, our semi-supervised DAE features obtain statistically significant improvements of l.34/2.45 (IWSLT) and 0.82/1.52 ( NIST ) BLEU points over the unsupervised DBN features and the baseline features, respectively. |
Conclusions | The results also demonstrate that DNN (DAE and HCDAE) features are complementary to the original features for SMT, and adding them together obtain statistically significant improvements of 3.16 (IWSLT) and 2.06 ( NIST ) BLEU points over the baseline features. |
Experiments and Results | NIST . |
Experiments and Results | Our development set is NIST 2005 MT evaluation set (1084 sentences), and our test set is NIST 2006 MT evaluation set (1664 sentences). |
Experiments and Results | Adding new DNN features as extra features significantly improves translation accuracy (row 2-17 vs. 1), with the highest increase of 2.45 (IWSLT) and 1.52 ( NIST ) (row 14 vs. 1) BLEU points over the baseline features. |
Introduction | Finally, we conduct large-scale experiments on IWSLT and NIST Chinese-English translation tasks, respectively, and the results demonstrate that our solutions solve the two aforementioned shortcomings successfully. |
Conclusion | o The sense-based translation model is able to substantially improve translation quality in terms of both BLEU and NIST . |
Experiments | We used the NIST MT03 evaluation test data as our development set, and the NIST MT05 as the test set. |
Experiments | We evaluated translation quality with the case-insensitive BLEU-4 (Papineni et al., 2002) and NIST (Doddington, 2002). |
Experiments | System BLEU(%) NIST STM (i5w) 34.64 9.4346 STM (i10w) 34.76 9.5114 STM (i15w) - - |
Experiments | Dataset and SMT Pipeline We use the NIST MT Chinese-English parallel corpus (NIS T), excluding non-UN and non-HK Hansards portions as our training dataset. |
Experiments | To optimize SMT system, we tune the parameters on NIST MT06, and report results on three test sets: MT02, MT03 and MT05.2 |
Experiments | Resources for Prior Tree To build the tree for tLDA and ptLDA, we extract the word correlations from a Chinese-English bilingual dictionary (Denisowski, 1997).4 We filter the dictionary using the NIST vocabulary, and keep entries mapping single Chinese and single English words. |
Experiments | We adopted three state-of-the-art metrics, BLEU (Papineni et al., 2002), NIST (Doddington et al., 2000) and METEOR (Banerjee and Lavie, 2005), to evaluate the translation quality. |
Experiments | The NIST evaluation campaign data, MT—03 and MT-05, are selected to comprise the MT development data, devMT, and testing data, testMT, respectively. |
Experiments | NIST and METEOR over others. |
Abstract | We present a set of dependency-based pre-ordering rules which improved the BLEU score by 1.61 on the NIST 2006 evaluation data. |
Experiments | Our development set was the official NIST MT evaluation data from 2002 to 2005, consisting of 4476 Chinese-English sentences pairs. |
Experiments | Our test set was the NIST 2006 MT evaluation data, consisting of 1664 sentence pairs. |
Introduction | Experiment results showed that our pre-ordering rule set improved the BLEU score on the NIST 2006 evaluation data by 1.61. |
Abstract | We apply our approach to a state-of-the-art phrase-based system and demonstrate very promising BLEU improvements and TER reductions on the NIST Chinese-English MT evaluation data. |
Conclusion and Future Work | The experimental results show that the proposed approach achieves very promising BLEU improvements and TER reductions on the NIST evaluation data. |
Evaluation | We used the newswire portion of the NIST MT06 evaluation data as our development set, and used the evaluation data of MT04 and MTOS as our test sets. |
Introduction | 0 We apply the proposed model to Chinese-English phrase-based MT and demonstrate promising BLEU improvements and TER reductions on the NIST evaluation data. |
Experimental Setup | We train our English-to-Arabic system using 1.49 million sentence pairs drawn from the NIST 2012 training set, excluding the UN data. |
Experimental Setup | We tune on the NIST 2004 evaluation set (1353 sentences) and evaluate on NIST 2005 (1056 sentences). |
Results | Judging from the output on the NIST 2005 test set, the system uses these discontiguous desegmentations very rarely: only 5% of desegmented tokens align to discontiguous source phrases. |
Abstract | Experimental results show that the proposed method is comparable to supervised segmenters on the in-domain NIST OpenMT corpus, and yields a 0.96 BLEU relative increase on NTCIR PatentMT corpus which is out-of-domain. |
Complexity Analysis | The first bilingual corpus: OpenMT06 was used in the NIST open machine translation 2006 Evaluation 2. |
Complexity Analysis | The data sets of NIST Eval 2002 to 2005 were used as the development for MERT tuning (Och, 2003). |
Introduction | The BABEL task is modeled on the 2006 NIST Spoken Term Detection evaluation ( NIST , 2006) but focuses on limited resource conditions. |
Results | At our disposal, we have the five BABEL languages — Tagalog, Cantonese, Pashto, Turkish and Vietnamese — as well as the development data from the NIST 2006 English evaluation. |
Term Detection Re-scoring | The primary metric for the BABEL program, Actual Term Weighted Value (ATWV) is defined by NIST using a cost function of the false alarm probability P(FA) and P(l\/liss), averaged over a set of queries ( NIST , 2006). |