Analysis and Discussion | CE_LD CE_SD testset ( NIST ) ’06 ’08 ’06 ’08 |
Analysis and Discussion | Table 4: Results (BLEU%) of Chinese—to—English large data (CE_LD) and small data (CE_SD) NIST task by applying one feature. |
Analysis and Discussion | Table 6: Results (BLEU%) of using simple features based on context on small data NIST task. |
Experiments | The first one is the large data condition, based on training data for the NIST 2 2009 evaluation Chinese-to-English track. |
Experiments | We first created a development set which used mainly data from the NIST 2005 test set, and also some balanced-genre web-text from the NIST training material. |
Experiments | Evaluation was performed on the NIST 2006 and 2008 test sets. |
Abstract | Experimental results on data sets for NIST Chinese-to—English machine translation task show that the co-decoding method can bring significant improvements to all baseline decoders, and the outputs from co-decoding can be used to further improve the result of system combination. |
Experiments | We conduct our experiments on the test data from the NIST 2005 and NIST 2008 Chinese-to-English machine translation tasks. |
Experiments | The NIST 2003 test data is used for development data to estimate model parameters. |
Experiments | In our experiments all the models are optimized with case-insensitive NIST version of BLEU score and we report results using this metric in percentage numbers. |
Introduction | We will present experimental results on the data sets of NIST Chinese-to-English machine translation task, and demonstrate that co-decoding can bring significant improvements to baseline systems. |
The Three-way Decision Task | The answer key for the three-way decision task was developed at the National Institute of Standards and Technology ( NIST ) using annotators who had experience as TREC and DUC assessors. |
The Three-way Decision Task | NIST assessors annotated all 800 entailment pairs in the test set, with each pair independently annotated by two different assessors. |
The Three-way Decision Task | The three-way answer key was formed by keeping exactly the same set of YES answers as in the two-way key (regardless of the NIST annotations) and having NIST staff adjudicate assessor differences on the remainder. |
Abstract | On the NIST OpenMT12 Arabic-English condition, the NNJ M features produce a gain of +3.0 BLEU on top of a powerful, feature-rich baseline which already includes a target-only NNLM. |
Introduction | We show primary results on the NIST OpenMT12 Arabic-English condition. |
Introduction | We also show strong improvements on the NIST OpenMT12 Chinese-English task, as well as the DARPA BOLT (Broad Operational Language Translation) Arabic-English and Chinese-English conditions. |
Model Variations | For Arabic word tokenization, we use the MADA-ARZ tokenizer (Habash et al., 2013) for the BOLT condition, and the Sakhr9 tokenizer for the NIST condition. |
Model Variations | We present MT primary results on Arabic-English and Chinese-English for the NIST OpenMT12 and DARPA BOLT conditions. |
Model Variations | 6.1 NIST OpenMT12 Results |
Abstract | We compare this metric against a combination metric of four state—of—the—art scores (BLEU, NIST , TER, and METEOR) in two different settings. |
EXpt. 1: Predicting Absolute Scores | Our first experiment evaluates the models we have proposed on a corpus with traditional annotation on a seven-point scale, namely the NIST OpenMT 2008 corpus.4 The corpus contains translations of newswire teXt into English from three source languages (Arabic (Ar), Chinese (Ch), Urdu (Ur)). |
EXpt. 1: Predicting Absolute Scores | BLEUR, METEORR, and NISTR significantly predict one language each (all Arabic); TERR, MTR, and RTER predict two languages. |
Experimental Evaluation | NISTR consists of 16 features. |
Experimental Evaluation | NIST-n scores (1 g n g 10) and information-weighted n-gram precision scores (1 g n g 4); NIST brevity penalty (BP); and NIST score divided by BP. |
Expt. 2: Predicting Pairwise Preferences | 1: Among individual metrics, METEORR and TERR do better than BLEUR and NISTR . |
Expt. 2: Predicting Pairwise Preferences | NISTR 50.2 70.4 |
Expt. 2: Predicting Pairwise Preferences | Again, we see better results for METEORR and TERR than for BLEUR and NISTR , and the individual metrics do worse than the combination models. |
Introduction | Since human evaluation is costly and difficult to do reliably, a major focus of research has been on automatic measures of MT quality, pioneered by BLEU (Papineni et a1., 2002) and NIST (Doddington, 2002). |
Introduction | BLEU and NIST measure MT quality by using the strong correlation between human judgments and the degree of n-gram overlap between a system hypothesis translation and one or more reference translations. |
Abstract | Experimental results show that our method significantly improves translation accuracy in the NIST Chinese-to-English translation task compared to a state-of-the-art baseline. |
Experiments | The NIST 2003 dataset is the development data. |
Experiments | The testing data consists of NIST 2004, 2005, 2006 and 2008 datasets. |
Experiments | NIST 2004 |
Introduction | We integrate topic similarity features in the log-linear model and evaluate the performance on the NIST Chinese-to-English translation task. |
Abstract | Experimental results show that, our method can significantly improve machine translation performance on both IWSLT and NIST data, compared with a state-of-the-art baseline. |
Conclusion and Future Work | We conduct experiments on IWSLT and NIST data, and our method can improve the performance significantly. |
Experiments and Results | We test our method with two data settings: one is IWSLT data set, the other is NIST data set. |
Experiments and Results | For the NIST data set, the bilingual training data we used is NIST 2008 training set excluding the Hong Kong Law and Hong Kong Hansard. |
Experiments and Results | The baseline results on NIST data are shown in Table 2. |
Introduction | We conduct experiments with IWSLT and NIST data, and experimental results show that, our method |
Discussion | Table 5: MBR Parameter Tuning on NIST systems |
Experiments | The first one is the constrained data track of the NIST Arabic-to-English (aren) and Chinese-to-English (zhen) translation taskl. |
Experiments | Table 1: Statistics over the NIST dev/test sets. |
Experiments | Our development set (dev) consists of the NIST 2005 eval set; we use this set for optimizing MBR parameters. |
Experimental Results | Group III: contains other important evaluation metrics, which were not considered in the WMT12 metrics task: NIST and ROUGE for both system- and segment-level, and BLEU and TER at segment-level. |
Experimental Results | NIST .817 .842 .875 |
Experimental Results | NIST .214 .172 .206 ROUGE .185 .144 .201 |
Experimental Setup | To complement the set of individual metrics that participated at the WMT12 metrics task, we also computed the scores of other commonly-used evaluation metrics: BLEU (Papineni et al., 2002), NIST (Doddington, 2002), TER (Snover et al., 2006), ROUGE-W (Lin, 2004), and three METEOR variants (Denkowski and Lavie, 2011): METEOR-ex (exact match), METEOR-st (+stemming) and METEOR-sy (+synonyms). |
Experimental Setup | Combination of five metrics based on lexical similarity: BLEU, NIST , METEOR-ex, ROUGE-W, and TERp-A. |
Related Work | The field of automatic evaluation metrics for MT is very active, and new metrics are continuously being proposed, especially in the context of the evaluation campaigns that run as part of the Workshops on Statistical Machine Translation (WMT 2008-2012), and NIST Metrics for Machine Translation Challenge (MetricsMATR), among others. |
Experiments | To demonstrate the effect of the {O-norm on the IBM models, we performed experiments on four translation tasks: Arabic-English, Chinese-English, and Urdu-English from the NIST Open MT Evaluation, and the Czech-English translation from the Workshop on Machine Translation (WMT) shared task. |
Experiments | 0 Chinese-English: selected data from the constrained task of the NIST 2009 Open MT Evaluation.3 |
Experiments | o Arabic-English: all available data for the constrained track of NIST 2009, excluding United Nations proceedings (LDC2004E13), ISI Automatically Extracted Parallel Text (LDC2007E08), and Ummah newswire text (LDC2004T18), for a total of 5.4+4.3 million words. |
Abstract | The experimental results on three NIST evaluation test sets show that our method leads to significant improvements in translation accuracy over the baseline systems. |
Background | 2 In this paper, we use the NIST definition of BLEU where the effective reference length is the length of the shortest reference translation. |
Background | The data set used for weight training in boosting-based system combination comes from NIST MTO3 evaluation set. |
Background | The test sets are the NIST evaluation sets of MTO4, MTOS and MTO6. |
Introduction | All the systems are evaluated on three NIST MT evaluation test sets. |
Metric Design Considerations | To evaluate our metric, we conduct experiments on datasets from the ACL-07 MT workshop and NIST |
Metric Design Considerations | Table 4: Correlations on the NIST MT 2003 dataset. |
Metric Design Considerations | 5.2 NIST MT 2003 Dataset |
Experiments | (2) The NIST task is Chinese-to-English translation with OpenMT08 training data and MT06 as devset. |
Experiments | Train Devset #Feat Metrics PubMed 0.2M 2k 14 BLEU, RIBES NIST 7M 1.6k 8 BLEU,NTER |
Experiments | Our MT models are trained with standard phrase-based Moses software (Koehn and others, 2007), with IBM M4 alignments, 4gram SRILM, leXical ordering for PubMed and distance ordering for the NIST system. |
Introduction | Experiments on NIST Chinese-English and PubMed English-Japanese translation using BLEU, TER, and RIBES are presented in Section 4. |
Experiments | We chose to use this data set, rather than more standard NIST test sets to ensure that we had recent documents in the test set (the most recent NIST test sets contain documents published in 2007, well before our microblog data was created). |
Experiments | For this test set, we used 8 million sentences from the full NIST parallel dataset as the language model training data. |
Experiments | FBIS 9.4 18.6 10.4 12.3 NIST 11.5 21.2 11.4 13.9 Weibo 8.75 15.9 15.7 17.2 |
Parallel Data Extraction | Likewise, for the EN-AR language pair, we use a fraction of the NIST dataset, by removing the data originated from UN, which leads to approximately 1M sentence pairs. |
Alternatives to Correlation-based Meta-evaluation | NIST 5 .70 randOST 5 .20 minOST 3.67 |
Correlation with Human Judgements | nist . |
Metrics and Test Beds | At the lexical level, we have included several standard metrics, based on different similarity assumptions: edit distance (WER, PER and TER), lexical precision (BLEU and NIST ), lexical recall (ROUGE), and F-measure (GTM and METEOR). |
Metrics and Test Beds | Table 1: NIST 2004/2005 MT Evaluation Campaigns. |
Metrics and Test Beds | We use the test beds from the 2004 and 2005 NIST MT Evaluation Campaigns (Le and Przy-bocki, 2005)2. |
Previous Work on Machine Translation Meta-Evaluation | With the aim of overcoming some of the deficiencies of BLEU, Doddington (2002) introduced the NIST metric. |
Previous Work on Machine Translation Meta-Evaluation | Lin and Och (2004) experimented, unlike previous works, with a wide set of metrics, including NIST , WER (NieBen et al., 2000), PER (Tillmann et al., 1997), and variants of ROUGE, BLEU and GTM. |
Abstract | We integrate our method into a state-of-the-art baseline translation system and show that it consistently improves the performance of the baseline system on various NIST MT test sets. |
Conclusions | We integrate our method into a state-of-the-art phrase-based baseline translation system, i.e., Moses (Koehn et al., 2007), and show that the integrated system consistently improves the performance of the baseline system on various NIST machine translation test sets. |
Experimental Results | We compile a parallel dataset which consists of various corpora distributed by the Linguistic Data Consortium (LDC) for NIST MT evaluation. |
Experimental Results | 4.5.2 BLEU on NIST MT Test Sets |
Experimental Results | Table 7 reports the results on various NIST MT test sets. |
Introduction | We carry out experiments on a state-of-the-art SMT system, i.e., Moses (Koehn et al., 2007), and show that the abbreviation translations consistently improve the translation performance (in terms of BLEU (Papineni et al., 2002)) on various NIST MT test sets. |
Experiments and Results | Additionally, NIST score (Dod-dington, 2002) and METEOR (Banerjee and La-vie, 2005) are also used to check the consistency of experimental results. |
Experiments and Results | BLEU 0.4029 0.3146 NIST 7.0419 8.8462 METEOR 0.5785 0.5335 |
Experiments and Results | Both SMP and ESSP outperform baseline consistently in BLEU, NIST and METEOR. |
Abstract | We train and test linguistic quality models on consecutive years of NIST evaluation data in order to show the generality of results. |
Conclusion | Automatic evaluation will make testing easier during system development and enable reporting results obtained outside of the cycles of NIST evaluation. |
Introduction | quality and none have been validated on data from NIST evaluations. |
Introduction | We evaluate the predictive power of these linguistic quality metrics by training and testing models on consecutive years of NIST evaluations (data described |
Results and discussion | In both DUC 2006 and DUC 2007, ten NIST assessors wrote summaries for the various inputs. |
Results and discussion | We only report results on the input level, as we are interested in distinguishing between the quality of the summaries, not the NIST assessors’ writing skills. |
Experimental Setup | We trained the system on the NIST MT06 Eval corpus excluding the UN data (approximately 900K sentence pairs). |
Experimental Setup | We used the NIST MT03 test set as the development set for optimizing interpolation weights using minimum error rate training (MERT; (Och and Ney, 2002)). |
Experimental Setup | We carried out evaluation of the systems on the NIST 2006 evaluation test (MT06) and the NIST 2008 evaluation test (MT08). |
Experiments | NIST , sentence-level n- gram overlap weighted in favour of less frequent n- grams, as in (Belz et al., 2011) |
Experiments | score for the REG—>LIN system comes close to the upper bound that applies linearization on linSynJflae, gold shallow trees with gold REs (BLEUT of 72.4), whereas the difference in standard BLEU and NIST is high. |
Experiments | Input System BLEU NIST BLEUT |
Conclusion | o The sense-based translation model is able to substantially improve translation quality in terms of both BLEU and NIST . |
Experiments | We used the NIST MT03 evaluation test data as our development set, and the NIST MT05 as the test set. |
Experiments | We evaluated translation quality with the case-insensitive BLEU-4 (Papineni et al., 2002) and NIST (Doddington, 2002). |
Experiments | System BLEU(%) NIST STM (i5w) 34.64 9.4346 STM (i10w) 34.76 9.5114 STM (i15w) - - |
Abstract | On an English-to-Iraqi CSLT task, the proposed approach gives significant improvements over a baseline system as measured by BLEU, TER, and NIST . |
Experimental Setup and Results | Table 1 summarizes test set performance in BLEU (Papineni et a1., 2001), NIST (Doddington, 2002) and TER (Snover et a1., 2006). |
Experimental Setup and Results | In the ASR setting, which simulates a real-world deployment scenario, this system achieves improvements of 0.39 (BLEU), -0.6 (TER) and 0.08 ( NIST ). |
Incremental Topic-Based Adaptation | 1 REFERENCE TRANSCRIPTIONS SYSTEM 1 BLEUT 1 TER1 1 NISTT |
Incremental Topic-Based Adaptation | SYSTEM 1 BLEUT 1 TER1 1 NISTT |
Introduction | With this approach, we demonstrate significant improvements over a baseline phrase-based SMT system as measured by BLEU, TER and NIST scores on an English-to-Iraqi CSLT task. |
Abstract | Experiments on large scale NIST evaluation data show improvements over strong baselines: +1.8 BLEU on Arabic to English and +1.4 BLEU on Chinese to English over a non-adapted baseline, and significant improvements in most circumstances over baselines with linear mixture model adaptation. |
Experiments | We carried out experiments in two different settings, both involving data from NIST Open MT 2012.2 The first setting is based on data from the Chinese to English constrained track, comprising about 283 million English running words. |
Experiments | The development set (tune) was taken from the NIST 2005 evaluation set, augmented with some web-genre material reserved from other NIST corpora. |
Experiments | Table 2: NIST Arabic-English data. |
Vector space model adaptation | Table 1: NIST Chinese-English data. |
Abstract | On two Chinese-English tasks, our semi-supervised DAE features obtain statistically significant improvements of l.34/2.45 (IWSLT) and 0.82/1.52 ( NIST ) BLEU points over the unsupervised DBN features and the baseline features, respectively. |
Conclusions | The results also demonstrate that DNN (DAE and HCDAE) features are complementary to the original features for SMT, and adding them together obtain statistically significant improvements of 3.16 (IWSLT) and 2.06 ( NIST ) BLEU points over the baseline features. |
Experiments and Results | NIST . |
Experiments and Results | Our development set is NIST 2005 MT evaluation set (1084 sentences), and our test set is NIST 2006 MT evaluation set (1664 sentences). |
Experiments and Results | Adding new DNN features as extra features significantly improves translation accuracy (row 2-17 vs. 1), with the highest increase of 2.45 (IWSLT) and 1.52 ( NIST ) (row 14 vs. 1) BLEU points over the baseline features. |
Introduction | Finally, we conduct large-scale experiments on IWSLT and NIST Chinese-English translation tasks, respectively, and the results demonstrate that our solutions solve the two aforementioned shortcomings successfully. |
Experiments | The second setting uses the non-UN and non-HK Hansards portions of the NIST training corpora with LTM only. |
Experiments | En Zh FBIS 269K 10.3M 7.9M NIST 1.6M 44.4M 40.4M |
Experiments | 2010) as our decoder, and tuned the parameters of the system to optimize BLEU (Papineni et al., 2002) on the NIST MT06 tuning corpus using the Margin Infused Relaxed Algorithm (MIRA) (Crammer et al., 2006; Eidelman, 2012). |
Experiments | We adopted three state-of-the-art metrics, BLEU (Papineni et al., 2002), NIST (Doddington et al., 2000) and METEOR (Banerjee and Lavie, 2005), to evaluate the translation quality. |
Experiments | The NIST evaluation campaign data, MT—03 and MT-05, are selected to comprise the MT development data, devMT, and testing data, testMT, respectively. |
Experiments | NIST and METEOR over others. |
Experiments | Dataset and SMT Pipeline We use the NIST MT Chinese-English parallel corpus (NIS T), excluding non-UN and non-HK Hansards portions as our training dataset. |
Experiments | To optimize SMT system, we tune the parameters on NIST MT06, and report results on three test sets: MT02, MT03 and MT05.2 |
Experiments | Resources for Prior Tree To build the tree for tLDA and ptLDA, we extract the word correlations from a Chinese-English bilingual dictionary (Denisowski, 1997).4 We filter the dictionary using the NIST vocabulary, and keep entries mapping single Chinese and single English words. |
Abstract | Our experiments on NIST 2008 testing data with automatic evaluation as well as human judgments suggest that the proposed method is able to enhance the paraphrase quality by adjusting between semantic equivalency and surface dissimilarity. |
Experiments and Results | We use 2003 NIST Open Machine Translation Evaluation data (NIST 2003) as development data (containing 919 sentences) for MERT and test the performance on NIST 2008 data set (containing 1357 sentences). |
Experiments and Results | NIST Chinese-to-English evaluation data offers four English human translations for every Chinese sentence. |
Experiments and Results | Table 1: iBLEU Score Results( NIST 2008) |
Introduction | We test our method on NIST 2008 testing data. |
Conclusions and Future Work | Our string-to-dependency system generates 80% fewer rules, and achieves 1.48 point improvement in BLEU and 2.53 point improvement in TER on the decoding output on the NIST 04 Chinese-English evaluation set. |
Experiments | We used part of the NIST 2006 Chinese-English large track data as well as some LDC corpora collected for the DARPA GALE program (LDC2005E83, LDC2006E34 and LDC2006G05) as our bilingual training data. |
Experiments | We tuned the weights on NIST MT05 and tested on MT04. |
Introduction | For example, Chiang (2007) showed that the Hiero system achieved about 1 to 3 point improvement in BLEU on the NIST 03/04/05 Chinese-English evaluation sets compared to a start-of-the-art phrasal system. |
Introduction | Our string-to-dependency decoder shows 1.48 point improvement in BLEU and 2.53 point improvement in TER on the NIST 04 Chinese-English MT evaluation set. |
Abstract | Our results show that augmenting a state-of-the-art phrase-based system with this dependency language model leads to significant improvements in TER (0.92%) and BLEU (0.45%) scores on five NIST Chinese-English evaluation test sets. |
Introduction | petitive phrase-based systems in large-scale experiments such as NIST evaluations.2 This lack of significant difference may not be completely surprising. |
Introduction | 2Results of the 2008 NIST Open MT evaluation (http://www.itl.nist.gov/iad/mig/tests/mt/2008/doc/ mt08_official_results_v0 .html) reveal that, while many of the best systems in the Chinese-English and Arabic-English tasks incorporate synchronous CFG models, score differences with the best phrase-based system were insignificantly small. |
Machine translation experiments | For tuning and testing, we use the official NIST MT evaluation data for Chinese from 2002 to 2008 (MT02 to MT08), which all have four English references for each input sentence. |
Machine translation experiments | Table 6 provides experimental results on the NIST test data (excluding the tuning set MTOS) for each of the three genres: newswire, web data, and speech (broadcast news and conversation). |
Experiments | Table 1: Inter-judge Kappa for the NIST 2008 English—Chinese task |
Experiments | 4.2 NIST 2008 English-Chinese MT Task |
Experiments | The NIST 2008 English-Chinese MT task consists of 127 documents with 1,830 segments, each with four reference translations and eleven automatic MT system translations. |
Introduction | The work compared various MT evaluation metrics (BLEU, NIST , METEOR, GTM, 1 — TER) with different segmentation schemes, and found that treating every single character as a token (character-level MT evaluation) gives the best correlation with human judgments. |
Abstract | On NIST MT08 set, our most advanced model brings around +2.0 BLEU and -1.0 TER improvement. |
Experiments | As for the blind test set, we report the performance on the NIST MT08 evaluation set, which consists of 691 sentences from newswire and 666 sentences from weblog. |
Experiments | Table 4 summarizes the experimental results on NIST MT08 newswire and weblog. |
Experiments | Table 4: The NIST MT08 results on newswire (nw) and weblog (wb) genres. |
Experiments | Results were evaluated with both BLEU (Papineni et al., 2001) and NIST metrics ( NIST , 2002). |
Experiments | BLEU NIST set devtest test07 devtest test07 baseline 18.13 18.05 5.218 5.279 person 18.16 18.17 5.224 5.316 |
Experiments | The NIST metric clearly shows a significant improvement, because it mostly measures difficult n-gram matches (e. g. due to the long-distance rules we have been dealing with). |
Abstract | We apply our approach to a state-of-the-art phrase-based system and demonstrate very promising BLEU improvements and TER reductions on the NIST Chinese-English MT evaluation data. |
Conclusion and Future Work | The experimental results show that the proposed approach achieves very promising BLEU improvements and TER reductions on the NIST evaluation data. |
Evaluation | We used the newswire portion of the NIST MT06 evaluation data as our development set, and used the evaluation data of MT04 and MTOS as our test sets. |
Introduction | 0 We apply the proposed model to Chinese-English phrase-based MT and demonstrate promising BLEU improvements and TER reductions on the NIST evaluation data. |
Experiments | We run an improved version of our 2006 NIST MT Evaluation entry for the Arabic-English “Unlimited” data track.6 The language model is the same one as in the previous section. |
Experiments | We use MT04 data for system development, with MT05 data and MT06 ( “NIST” subset) data for blind testing. |
Experiments | Overall, our baseline results compare favorably to those reported on the NIST MT06 web site. |
Abstract | Experimental results on the NIST MT-2005 Chinese-English translation task show that our method statistically significantly outperforms the baseline systems. |
Conclusions and Future Work | The experimental results on the NIST MT-2005 Chinese-English translation task demonstrate the effectiveness of the proposed model. |
Experiments | We used sentences with less than 50 characters from the NIST MT-2002 test set as our development set and the NIST MT-2005 test set as our test set. |
Introduction | Experiment results on the NIST MT-2005 Chinese-English translation task show that our method significantly outperforms Moses (Koehn et al., 2007), a state-of-the-art phrase-based SMT system, and other linguistically syntax-based methods, such as SCFG-based and STSG-based methods (Zhang et al., 2007). |
Abstract | Experimental results on the NIST MT-2003 Chinese-English translation task show that our method statistically significantly outperforms the four baseline systems. |
Conclusion | Finally, we examine our methods on the FBIS corpus and the NIST MT-2003 Chinese-English translation task. |
Experiment | We use the FBIS corpus as training set, the NIST MT-2002 test set as development (deV) set and the NIST MT-2003 test set as test set. |
Introduction | We evaluate our method on the NIST MT-2003 Chinese-English translation tasks. |
Abstract | We present a set of dependency-based pre-ordering rules which improved the BLEU score by 1.61 on the NIST 2006 evaluation data. |
Experiments | Our development set was the official NIST MT evaluation data from 2002 to 2005, consisting of 4476 Chinese-English sentences pairs. |
Experiments | Our test set was the NIST 2006 MT evaluation data, consisting of 1664 sentence pairs. |
Introduction | Experiment results showed that our pre-ordering rule set improved the BLEU score on the NIST 2006 evaluation data by 1.61. |
Experiments | For the error detection task, we use the best translation hypotheses of NIST MT-02/05/03 generated by MOSES as our training, development, and test corpus respectively. |
SMT System | The translation task is on the official NIST Chinese-to-English evaluation data. |
SMT System | For minimum error rate tuning (Och, 2003), we use NIST MT-02 as the development set for the translation task. |
SMT System | In order to calculate word posterior probabilities, we generate 10,000 best lists for NIST MT-02/03/05 respectively. |
Abstract | The data generated allows us to train a reordering model that gives an improvement of 1.8 BLEU points on the NIST MT—08 Urdu-English evaluation set over a reordering model that only uses manual word alignments, and a gain of 5.2 BLEU points over a standard phrase-based baseline. |
Experimental setup | We use about 10K sentences (180K words) of manual word alignments which were created in house using part of the NIST MT—08 training data3 to train our baseline reordering model and to train our supervised machine aligners. |
Experimental setup | We use a parallel corpus of 3.9M words consisting of 1.7M words from the NIST MT—08 training data set and 2.2M words extracted from parallel news stories on the |
Experimental setup | We report results on the (four reference) NIST MT—08 evaluation set in Table 4 for the News and Web conditions. |
Experiments | The dev set comprised mainly data from the NIST 2005 test set, and also some balanced-genre web-text from NIST . |
Experiments | Evaluation was performed on NIST 2006 and 2008. |
Experiments | Table 10: Ordering scores (p, I and v) for test sets NIST |
Introduction | 0 BLEU (Papineni et al., 2002), NIST (Doddington, 2002), WER, PER, TER (Snover et al., 2006), and LRscore (Birch and Osborne, 2011) do not use external linguistic |
Abstract | Experiments on Chinese—English translation on four NIST MT test sets show that the HD—HPB model significantly outperforms Chiang’s model with average gains of 1.91 points absolute in BLEU. |
Experiments | We train our model on a dataset with ~1.5M sentence pairs from the LDC dataset.2 We use the 2002 NIST MT evaluation test data (878 sentence pairs) as the development data, and the 2003, 2004, 2005, 2006-news NIST MT evaluation test data (919, 1788, 1082, and 616 sentence pairs, respectively) as the test data. |
Experiments | For evaluation, the NIST BLEU script (version 12) with the default settings is used to calculate the BLEU scores. |
Introduction | Experiments on Chinese-English translation using four NIST MT test sets show that our HD-HPB model significantly outperforms Chiang’s HPB as well as a SAMT—style refined version of HPB. |
Abstract | We show that our model significantly improves the translation performance over the baseline on NIST Chinese-to-English translation experiments. |
Experiments | We present our experiments on the NIST Chinese-English translation tasks. |
Experiments | We used the NIST evaluation set of 2005 (MT05) as our development set, and sets of MT06/MT08 as test sets. |
Experiments | Case-insensitive NIST BLEU (Papineni et al., 2002) was used to mea- |
Analysis | The mass is concentrated along the diagonal, probably because MT05/6/8 was prepared by NIST , an American agency, while the bitext was collected from many sources including Agence France Presse. |
Experiments | 4.3 NIST OpenMT Experiment |
Experiments | However, the bitext5k models do not generalize as well to the NIST evaluation sets as represented by the MT04 result. |
Introduction | The first experiment uses standard tuning and test sets from the NIST OpenMT competitions. |
Abstract | Combining the two techniques, we show that using a fast shift-reduce parser we can achieve significant quality gains in NIST 2008 English-to-Chinese track (1.3 BLEU points over a phrase-based system, 0.8 BLEU points over a hierarchical phrase-based system). |
Experiments | For English-to-Chinese translation, we used all the allowed training sets in the NIST 2008 constrained track. |
Experiments | For NIST , we filtered out sentences exceeding 80 words in the parallel texts. |
Experiments | Here the training data consists of the non-UN portions and non-HK Hansards portions of the NIST training corpora distributed by the LDC, totalling 303k sentence pairs with 8m and 9.4m words of Chinese and English, respectively. |
Experiments | For the development set we use the NIST 2002 test set, and evaluate performance on the test sets from NIST 2003 |
Experiments | We evaluate on the NIST test sets from 2003 and 2005, and the 2002 test set was used for MERT training. |
Additional Experiments | For training, we used the non-UN portion of the NIST training corpora, which was segmented using an HMM segmenter (Lee et al., 2003). |
Experiments | For training we used the non-UN and non-HK Hansards portions of the NIST training corpora, which was segmented using the Stanford segmenter (Tseng et al., 2005). |
Experiments | We used cdec (Dyer et al., 2010) as our hierarchical phrase-based decoder, and tuned the parameters of the system to optimize BLEU (Papineni et al., 2002) on the NIST MT06 corpus. |
Experiments | In addition to BLEU score, percentage of exactly matched sentences and average NIST simple string accuracy (SSA) are adopted as evaluation metrics. |
Experiments | The average NIST simple string accuracy score reflects the average number of insertion (I), deletion (D), and substitution (5) errors between the output sentence and the reference sentence. |
Log-linear Models | 3 The BLEU scoring script is supplied by NIST Open Machine Translation Evaluation at ftp://iaguarncsl.nist.gov/mt/resources/mteval-vl lb.pl |
Abstract | As our approach combines the merits of phrase-based and string-to-dependency models, it achieves significant improvements over the two baselines on the NIST Chinese-English datasets. |
Introduction | We evaluate our method on the NIST Chinese-English translation datasets. |
Introduction | We used the 2002 NIST MT Chinese-English dataset as the development set and the 2003-2005 NIST datasets as the testsets. |
Evaluation methodology | In addition to human evaluation, we also ran system-level automatic evaluations using BLEU (Papineni et al., 2001), NIST (Doddington, 2002), METEOR (Banerjee and Lavie, 2005), TER (Snover et al., 2009), and GTM (Turian et al., 2003). |
Results | The lower part of Table 2 also reports the results of simulated dynamic ranking (using the NIST rankings as the initial order for the sort operation). |
Results | Sentence level Corpus Metric Median Mean Trimmed level BLEU 0.357 0.298 0.348 0.833 NIST 0.357 0.291 0.347 0.810 Meteor 0.429 0.348 0.393 0.714 TER 0.214 0.186 0.204 0.619 GTM 0.429 0.340 0.392 0.714 |
Experimental Setup | NIST 13.01 12.95 12.69 Match 27.91 27.66 26.38 Ling. |
Experimental Setup | Model BLEU 0.764 0.759 0.747 NIST 13.18 13.14 13.01 |
Experimental Setup | use several standard measures: a) exact match: how often does the model select the original corpus sentence, b) BLEU: n-gram overlap between top-ranked and original sentence, c) NIST : modification of BLEU giving more weight to less frequent n-grams. |
Experimental Setup | We train our English-to-Arabic system using 1.49 million sentence pairs drawn from the NIST 2012 training set, excluding the UN data. |
Experimental Setup | We tune on the NIST 2004 evaluation set (1353 sentences) and evaluate on NIST 2005 (1056 sentences). |
Results | Judging from the output on the NIST 2005 test set, the system uses these discontiguous desegmentations very rarely: only 5% of desegmented tokens align to discontiguous source phrases. |
Abstract | Experimental results show that the proposed method is comparable to supervised segmenters on the in-domain NIST OpenMT corpus, and yields a 0.96 BLEU relative increase on NTCIR PatentMT corpus which is out-of-domain. |
Complexity Analysis | The first bilingual corpus: OpenMT06 was used in the NIST open machine translation 2006 Evaluation 2. |
Complexity Analysis | The data sets of NIST Eval 2002 to 2005 were used as the development for MERT tuning (Och, 2003). |
Introduction | The BABEL task is modeled on the 2006 NIST Spoken Term Detection evaluation ( NIST , 2006) but focuses on limited resource conditions. |
Results | At our disposal, we have the five BABEL languages — Tagalog, Cantonese, Pashto, Turkish and Vietnamese — as well as the development data from the NIST 2006 English evaluation. |
Term Detection Re-scoring | The primary metric for the BABEL program, Actual Term Weighted Value (ATWV) is defined by NIST using a cost function of the false alarm probability P(FA) and P(l\/liss), averaged over a set of queries ( NIST , 2006). |
Experiments | For Chinese-to-English translation, we use the parallel data from NIST Open Machine Translation Evaluation tasks. |
Experiments | The NIST 2003 and 2005 test data are respectively taken as the development and test set. |
Introduction | (2008) and achieved state-of-the-art results as reported in the NIST 2008 Open MT Evaluation workshop and the NTCIR-9 Chinese-to-English patent translation task (Goto et al., 2011; Ma and Matsoukas, 2011). |
Experiments | develop NIST 2002 878 10 NIST 2005 1,082 4 NIST 2004 1,788 5 test NIST 2006 1,664 4 NIST 2008 1,357 4 |
Experiments | The system was tested using the Chinese-English MT evaluation sets of NIST 2004, NIST 2006 and NIST 2008. |
Experiments | For development, we used the Chinese-English MT evaluation sets of NIST 2002 and NIST 2005. |
Experiment Results | The language model is the interpolation of 5-gram language models built from news corpora of the NIST 2012 evaluation. |
Experiment Results | We tuned the parameters on the MT06 NIST test set (1664 sentences) and report the BLEU scores on three unseen test sets: MT04 (1353 sentences), MT05 (1056 sentences) and MT09 (1313 sentences). |
Experiment Results | We tuned the parameters on MT06 NIST test set of 1664 sentences and report the results of MT04, MT05 and MT08 unseen test sets. |