Experiments | We chose to use this data set, rather than more standard NIST test sets to ensure that we had recent documents in the test set (the most recent NIST test sets contain documents published in 2007, well before our microblog data was created). |
Experiments | For this test set, we used 8 million sentences from the full NIST parallel dataset as the language model training data. |
Experiments | FBIS 9.4 18.6 10.4 12.3 NIST 11.5 21.2 11.4 13.9 Weibo 8.75 15.9 15.7 17.2 |
Parallel Data Extraction | Likewise, for the EN-AR language pair, we use a fraction of the NIST dataset, by removing the data originated from UN, which leads to approximately 1M sentence pairs. |
Abstract | Experiments on large scale NIST evaluation data show improvements over strong baselines: +1.8 BLEU on Arabic to English and +1.4 BLEU on Chinese to English over a non-adapted baseline, and significant improvements in most circumstances over baselines with linear mixture model adaptation. |
Experiments | We carried out experiments in two different settings, both involving data from NIST Open MT 2012.2 The first setting is based on data from the Chinese to English constrained track, comprising about 283 million English running words. |
Experiments | The development set (tune) was taken from the NIST 2005 evaluation set, augmented with some web-genre material reserved from other NIST corpora. |
Experiments | Table 2: NIST Arabic-English data. |
Vector space model adaptation | Table 1: NIST Chinese-English data. |
Abstract | On an English-to-Iraqi CSLT task, the proposed approach gives significant improvements over a baseline system as measured by BLEU, TER, and NIST . |
Experimental Setup and Results | Table 1 summarizes test set performance in BLEU (Papineni et a1., 2001), NIST (Doddington, 2002) and TER (Snover et a1., 2006). |
Experimental Setup and Results | In the ASR setting, which simulates a real-world deployment scenario, this system achieves improvements of 0.39 (BLEU), -0.6 (TER) and 0.08 ( NIST ). |
Incremental Topic-Based Adaptation | 1 REFERENCE TRANSCRIPTIONS SYSTEM 1 BLEUT 1 TER1 1 NISTT |
Incremental Topic-Based Adaptation | SYSTEM 1 BLEUT 1 TER1 1 NISTT |
Introduction | With this approach, we demonstrate significant improvements over a baseline phrase-based SMT system as measured by BLEU, TER and NIST scores on an English-to-Iraqi CSLT task. |
Experiments | NIST , sentence-level n- gram overlap weighted in favour of less frequent n- grams, as in (Belz et al., 2011) |
Experiments | score for the REG—>LIN system comes close to the upper bound that applies linearization on linSynJflae, gold shallow trees with gold REs (BLEUT of 72.4), whereas the difference in standard BLEU and NIST is high. |
Experiments | Input System BLEU NIST BLEUT |
Analysis | The mass is concentrated along the diagonal, probably because MT05/6/8 was prepared by NIST , an American agency, while the bitext was collected from many sources including Agence France Presse. |
Experiments | 4.3 NIST OpenMT Experiment |
Experiments | However, the bitext5k models do not generalize as well to the NIST evaluation sets as represented by the MT04 result. |
Introduction | The first experiment uses standard tuning and test sets from the NIST OpenMT competitions. |
Abstract | On NIST MT08 set, our most advanced model brings around +2.0 BLEU and -1.0 TER improvement. |
Experiments | As for the blind test set, we report the performance on the NIST MT08 evaluation set, which consists of 691 sentences from newswire and 666 sentences from weblog. |
Experiments | Table 4 summarizes the experimental results on NIST MT08 newswire and weblog. |
Experiments | Table 4: The NIST MT08 results on newswire (nw) and weblog (wb) genres. |
Abstract | The data generated allows us to train a reordering model that gives an improvement of 1.8 BLEU points on the NIST MT—08 Urdu-English evaluation set over a reordering model that only uses manual word alignments, and a gain of 5.2 BLEU points over a standard phrase-based baseline. |
Experimental setup | We use about 10K sentences (180K words) of manual word alignments which were created in house using part of the NIST MT—08 training data3 to train our baseline reordering model and to train our supervised machine aligners. |
Experimental setup | We use a parallel corpus of 3.9M words consisting of 1.7M words from the NIST MT—08 training data set and 2.2M words extracted from parallel news stories on the |
Experimental setup | We report results on the (four reference) NIST MT—08 evaluation set in Table 4 for the News and Web conditions. |
Evaluation methodology | In addition to human evaluation, we also ran system-level automatic evaluations using BLEU (Papineni et al., 2001), NIST (Doddington, 2002), METEOR (Banerjee and Lavie, 2005), TER (Snover et al., 2009), and GTM (Turian et al., 2003). |
Results | The lower part of Table 2 also reports the results of simulated dynamic ranking (using the NIST rankings as the initial order for the sort operation). |
Results | Sentence level Corpus Metric Median Mean Trimmed level BLEU 0.357 0.298 0.348 0.833 NIST 0.357 0.291 0.347 0.810 Meteor 0.429 0.348 0.393 0.714 TER 0.214 0.186 0.204 0.619 GTM 0.429 0.340 0.392 0.714 |
Additional Experiments | For training, we used the non-UN portion of the NIST training corpora, which was segmented using an HMM segmenter (Lee et al., 2003). |
Experiments | For training we used the non-UN and non-HK Hansards portions of the NIST training corpora, which was segmented using the Stanford segmenter (Tseng et al., 2005). |
Experiments | We used cdec (Dyer et al., 2010) as our hierarchical phrase-based decoder, and tuned the parameters of the system to optimize BLEU (Papineni et al., 2002) on the NIST MT06 corpus. |
Experiments | Here the training data consists of the non-UN portions and non-HK Hansards portions of the NIST training corpora distributed by the LDC, totalling 303k sentence pairs with 8m and 9.4m words of Chinese and English, respectively. |
Experiments | For the development set we use the NIST 2002 test set, and evaluate performance on the test sets from NIST 2003 |
Experiments | We evaluate on the NIST test sets from 2003 and 2005, and the 2002 test set was used for MERT training. |
Abstract | As our approach combines the merits of phrase-based and string-to-dependency models, it achieves significant improvements over the two baselines on the NIST Chinese-English datasets. |
Introduction | We evaluate our method on the NIST Chinese-English translation datasets. |
Introduction | We used the 2002 NIST MT Chinese-English dataset as the development set and the 2003-2005 NIST datasets as the testsets. |
Experiment Results | The language model is the interpolation of 5-gram language models built from news corpora of the NIST 2012 evaluation. |
Experiment Results | We tuned the parameters on the MT06 NIST test set (1664 sentences) and report the BLEU scores on three unseen test sets: MT04 (1353 sentences), MT05 (1056 sentences) and MT09 (1313 sentences). |
Experiment Results | We tuned the parameters on MT06 NIST test set of 1664 sentences and report the results of MT04, MT05 and MT08 unseen test sets. |