Index of papers in Proc. ACL that mention
  • NIST
Chen, Boxing and Foster, George and Kuhn, Roland
Analysis and Discussion
CE_LD CE_SD testset ( NIST ) ’06 ’08 ’06 ’08
Analysis and Discussion
Table 4: Results (BLEU%) of Chinese—to—English large data (CE_LD) and small data (CE_SD) NIST task by applying one feature.
Analysis and Discussion
Table 6: Results (BLEU%) of using simple features based on context on small data NIST task.
Experiments
The first one is the large data condition, based on training data for the NIST 2 2009 evaluation Chinese-to-English track.
Experiments
We first created a development set which used mainly data from the NIST 2005 test set, and also some balanced-genre web-text from the NIST training material.
Experiments
Evaluation was performed on the NIST 2006 and 2008 test sets.
NIST is mentioned in 11 sentences in this paper.
Topics mentioned in this paper:
Li, Mu and Duan, Nan and Zhang, Dongdong and Li, Chi-Ho and Zhou, Ming
Abstract
Experimental results on data sets for NIST Chinese-to—English machine translation task show that the co-decoding method can bring significant improvements to all baseline decoders, and the outputs from co-decoding can be used to further improve the result of system combination.
Experiments
We conduct our experiments on the test data from the NIST 2005 and NIST 2008 Chinese-to-English machine translation tasks.
Experiments
The NIST 2003 test data is used for development data to estimate model parameters.
Experiments
In our experiments all the models are optimized with case-insensitive NIST version of BLEU score and we report results using this metric in percentage numbers.
Introduction
We will present experimental results on the data sets of NIST Chinese-to-English machine translation task, and demonstrate that co-decoding can bring significant improvements to baseline systems.
NIST is mentioned in 16 sentences in this paper.
Topics mentioned in this paper:
Voorhees, Ellen M.
The Three-way Decision Task
The answer key for the three-way decision task was developed at the National Institute of Standards and Technology ( NIST ) using annotators who had experience as TREC and DUC assessors.
The Three-way Decision Task
NIST assessors annotated all 800 entailment pairs in the test set, with each pair independently annotated by two different assessors.
The Three-way Decision Task
The three-way answer key was formed by keeping exactly the same set of YES answers as in the two-way key (regardless of the NIST annotations) and having NIST staff adjudicate assessor differences on the remainder.
NIST is mentioned in 17 sentences in this paper.
Topics mentioned in this paper:
Devlin, Jacob and Zbib, Rabih and Huang, Zhongqiang and Lamar, Thomas and Schwartz, Richard and Makhoul, John
Abstract
On the NIST OpenMT12 Arabic-English condition, the NNJ M features produce a gain of +3.0 BLEU on top of a powerful, feature-rich baseline which already includes a target-only NNLM.
Introduction
We show primary results on the NIST OpenMT12 Arabic-English condition.
Introduction
We also show strong improvements on the NIST OpenMT12 Chinese-English task, as well as the DARPA BOLT (Broad Operational Language Translation) Arabic-English and Chinese-English conditions.
Model Variations
For Arabic word tokenization, we use the MADA-ARZ tokenizer (Habash et al., 2013) for the BOLT condition, and the Sakhr9 tokenizer for the NIST condition.
Model Variations
We present MT primary results on Arabic-English and Chinese-English for the NIST OpenMT12 and DARPA BOLT conditions.
Model Variations
6.1 NIST OpenMT12 Results
NIST is mentioned in 13 sentences in this paper.
Topics mentioned in this paper:
Pado, Sebastian and Galley, Michel and Jurafsky, Dan and Manning, Christopher D.
Abstract
We compare this metric against a combination metric of four state—of—the—art scores (BLEU, NIST , TER, and METEOR) in two different settings.
EXpt. 1: Predicting Absolute Scores
Our first experiment evaluates the models we have proposed on a corpus with traditional annotation on a seven-point scale, namely the NIST OpenMT 2008 corpus.4 The corpus contains translations of newswire teXt into English from three source languages (Arabic (Ar), Chinese (Ch), Urdu (Ur)).
EXpt. 1: Predicting Absolute Scores
BLEUR, METEORR, and NISTR significantly predict one language each (all Arabic); TERR, MTR, and RTER predict two languages.
Experimental Evaluation
NISTR consists of 16 features.
Experimental Evaluation
NIST-n scores (1 g n g 10) and information-weighted n-gram precision scores (1 g n g 4); NIST brevity penalty (BP); and NIST score divided by BP.
Expt. 2: Predicting Pairwise Preferences
1: Among individual metrics, METEORR and TERR do better than BLEUR and NISTR .
Expt. 2: Predicting Pairwise Preferences
NISTR 50.2 70.4
Expt. 2: Predicting Pairwise Preferences
Again, we see better results for METEORR and TERR than for BLEUR and NISTR , and the individual metrics do worse than the combination models.
Introduction
Since human evaluation is costly and difficult to do reliably, a major focus of research has been on automatic measures of MT quality, pioneered by BLEU (Papineni et a1., 2002) and NIST (Doddington, 2002).
Introduction
BLEU and NIST measure MT quality by using the strong correlation between human judgments and the degree of n-gram overlap between a system hypothesis translation and one or more reference translations.
NIST is mentioned in 10 sentences in this paper.
Topics mentioned in this paper:
Cui, Lei and Zhang, Dongdong and Liu, Shujie and Chen, Qiming and Li, Mu and Zhou, Ming and Yang, Muyun
Abstract
Experimental results show that our method significantly improves translation accuracy in the NIST Chinese-to-English translation task compared to a state-of-the-art baseline.
Experiments
The NIST 2003 dataset is the development data.
Experiments
The testing data consists of NIST 2004, 2005, 2006 and 2008 datasets.
Experiments
NIST 2004
Introduction
We integrate topic similarity features in the log-linear model and evaluate the performance on the NIST Chinese-to-English translation task.
NIST is mentioned in 9 sentences in this paper.
Topics mentioned in this paper:
Liu, Shujie and Li, Chi-Ho and Li, Mu and Zhou, Ming
Abstract
Experimental results show that, our method can significantly improve machine translation performance on both IWSLT and NIST data, compared with a state-of-the-art baseline.
Conclusion and Future Work
We conduct experiments on IWSLT and NIST data, and our method can improve the performance significantly.
Experiments and Results
We test our method with two data settings: one is IWSLT data set, the other is NIST data set.
Experiments and Results
For the NIST data set, the bilingual training data we used is NIST 2008 training set excluding the Hong Kong Law and Hong Kong Hansard.
Experiments and Results
The baseline results on NIST data are shown in Table 2.
Introduction
We conduct experiments with IWSLT and NIST data, and experimental results show that, our method
NIST is mentioned in 9 sentences in this paper.
Topics mentioned in this paper:
Kumar, Shankar and Macherey, Wolfgang and Dyer, Chris and Och, Franz
Discussion
Table 5: MBR Parameter Tuning on NIST systems
Experiments
The first one is the constrained data track of the NIST Arabic-to-English (aren) and Chinese-to-English (zhen) translation taskl.
Experiments
Table 1: Statistics over the NIST dev/test sets.
Experiments
Our development set (dev) consists of the NIST 2005 eval set; we use this set for optimizing MBR parameters.
NIST is mentioned in 9 sentences in this paper.
Topics mentioned in this paper:
Guzmán, Francisco and Joty, Shafiq and Màrquez, Llu'is and Nakov, Preslav
Experimental Results
Group III: contains other important evaluation metrics, which were not considered in the WMT12 metrics task: NIST and ROUGE for both system- and segment-level, and BLEU and TER at segment-level.
Experimental Results
NIST .817 .842 .875
Experimental Results
NIST .214 .172 .206 ROUGE .185 .144 .201
Experimental Setup
To complement the set of individual metrics that participated at the WMT12 metrics task, we also computed the scores of other commonly-used evaluation metrics: BLEU (Papineni et al., 2002), NIST (Doddington, 2002), TER (Snover et al., 2006), ROUGE-W (Lin, 2004), and three METEOR variants (Denkowski and Lavie, 2011): METEOR-ex (exact match), METEOR-st (+stemming) and METEOR-sy (+synonyms).
Experimental Setup
Combination of five metrics based on lexical similarity: BLEU, NIST , METEOR-ex, ROUGE-W, and TERp-A.
Related Work
The field of automatic evaluation metrics for MT is very active, and new metrics are continuously being proposed, especially in the context of the evaluation campaigns that run as part of the Workshops on Statistical Machine Translation (WMT 2008-2012), and NIST Metrics for Machine Translation Challenge (MetricsMATR), among others.
NIST is mentioned in 9 sentences in this paper.
Topics mentioned in this paper:
Vaswani, Ashish and Huang, Liang and Chiang, David
Experiments
To demonstrate the effect of the {O-norm on the IBM models, we performed experiments on four translation tasks: Arabic-English, Chinese-English, and Urdu-English from the NIST Open MT Evaluation, and the Czech-English translation from the Workshop on Machine Translation (WMT) shared task.
Experiments
0 Chinese-English: selected data from the constrained task of the NIST 2009 Open MT Evaluation.3
Experiments
o Arabic-English: all available data for the constrained track of NIST 2009, excluding United Nations proceedings (LDC2004E13), ISI Automatically Extracted Parallel Text (LDC2007E08), and Ummah newswire text (LDC2004T18), for a total of 5.4+4.3 million words.
NIST is mentioned in 8 sentences in this paper.
Topics mentioned in this paper:
Xiao, Tong and Zhu, Jingbo and Zhu, Muhua and Wang, Huizhen
Abstract
The experimental results on three NIST evaluation test sets show that our method leads to significant improvements in translation accuracy over the baseline systems.
Background
2 In this paper, we use the NIST definition of BLEU where the effective reference length is the length of the shortest reference translation.
Background
The data set used for weight training in boosting-based system combination comes from NIST MTO3 evaluation set.
Background
The test sets are the NIST evaluation sets of MTO4, MTOS and MTO6.
Introduction
All the systems are evaluated on three NIST MT evaluation test sets.
NIST is mentioned in 7 sentences in this paper.
Topics mentioned in this paper:
Chan, Yee Seng and Ng, Hwee Tou
Metric Design Considerations
To evaluate our metric, we conduct experiments on datasets from the ACL-07 MT workshop and NIST
Metric Design Considerations
Table 4: Correlations on the NIST MT 2003 dataset.
Metric Design Considerations
5.2 NIST MT 2003 Dataset
NIST is mentioned in 7 sentences in this paper.
Topics mentioned in this paper:
Duh, Kevin and Sudoh, Katsuhito and Wu, Xianchao and Tsukada, Hajime and Nagata, Masaaki
Experiments
(2) The NIST task is Chinese-to-English translation with OpenMT08 training data and MT06 as devset.
Experiments
Train Devset #Feat Metrics PubMed 0.2M 2k 14 BLEU, RIBES NIST 7M 1.6k 8 BLEU,NTER
Experiments
Our MT models are trained with standard phrase-based Moses software (Koehn and others, 2007), with IBM M4 alignments, 4gram SRILM, leXical ordering for PubMed and distance ordering for the NIST system.
Introduction
Experiments on NIST Chinese-English and PubMed English-Japanese translation using BLEU, TER, and RIBES are presented in Section 4.
NIST is mentioned in 7 sentences in this paper.
Topics mentioned in this paper:
Ling, Wang and Xiang, Guang and Dyer, Chris and Black, Alan and Trancoso, Isabel
Experiments
We chose to use this data set, rather than more standard NIST test sets to ensure that we had recent documents in the test set (the most recent NIST test sets contain documents published in 2007, well before our microblog data was created).
Experiments
For this test set, we used 8 million sentences from the full NIST parallel dataset as the language model training data.
Experiments
FBIS 9.4 18.6 10.4 12.3 NIST 11.5 21.2 11.4 13.9 Weibo 8.75 15.9 15.7 17.2
Parallel Data Extraction
Likewise, for the EN-AR language pair, we use a fraction of the NIST dataset, by removing the data originated from UN, which leads to approximately 1M sentence pairs.
NIST is mentioned in 7 sentences in this paper.
Topics mentioned in this paper:
Amigó, Enrique and Giménez, Jesús and Gonzalo, Julio and Verdejo, Felisa
Alternatives to Correlation-based Meta-evaluation
NIST 5 .70 randOST 5 .20 minOST 3.67
Correlation with Human Judgements
nist .
Metrics and Test Beds
At the lexical level, we have included several standard metrics, based on different similarity assumptions: edit distance (WER, PER and TER), lexical precision (BLEU and NIST ), lexical recall (ROUGE), and F-measure (GTM and METEOR).
Metrics and Test Beds
Table 1: NIST 2004/2005 MT Evaluation Campaigns.
Metrics and Test Beds
We use the test beds from the 2004 and 2005 NIST MT Evaluation Campaigns (Le and Przy-bocki, 2005)2.
Previous Work on Machine Translation Meta-Evaluation
With the aim of overcoming some of the deficiencies of BLEU, Doddington (2002) introduced the NIST metric.
Previous Work on Machine Translation Meta-Evaluation
Lin and Och (2004) experimented, unlike previous works, with a wide set of metrics, including NIST , WER (NieBen et al., 2000), PER (Tillmann et al., 1997), and variants of ROUGE, BLEU and GTM.
NIST is mentioned in 7 sentences in this paper.
Topics mentioned in this paper:
Li, Zhifei and Yarowsky, David
Abstract
We integrate our method into a state-of-the-art baseline translation system and show that it consistently improves the performance of the baseline system on various NIST MT test sets.
Conclusions
We integrate our method into a state-of-the-art phrase-based baseline translation system, i.e., Moses (Koehn et al., 2007), and show that the integrated system consistently improves the performance of the baseline system on various NIST machine translation test sets.
Experimental Results
We compile a parallel dataset which consists of various corpora distributed by the Linguistic Data Consortium (LDC) for NIST MT evaluation.
Experimental Results
4.5.2 BLEU on NIST MT Test Sets
Experimental Results
Table 7 reports the results on various NIST MT test sets.
Introduction
We carry out experiments on a state-of-the-art SMT system, i.e., Moses (Koehn et al., 2007), and show that the abbreviation translations consistently improve the translation performance (in terms of BLEU (Papineni et al., 2002)) on various NIST MT test sets.
NIST is mentioned in 6 sentences in this paper.
Topics mentioned in this paper:
Duan, Xiangyu and Zhang, Min and Li, Haizhou
Experiments and Results
Additionally, NIST score (Dod-dington, 2002) and METEOR (Banerjee and La-vie, 2005) are also used to check the consistency of experimental results.
Experiments and Results
BLEU 0.4029 0.3146 NIST 7.0419 8.8462 METEOR 0.5785 0.5335
Experiments and Results
Both SMP and ESSP outperform baseline consistently in BLEU, NIST and METEOR.
NIST is mentioned in 6 sentences in this paper.
Topics mentioned in this paper:
Pitler, Emily and Louis, Annie and Nenkova, Ani
Abstract
We train and test linguistic quality models on consecutive years of NIST evaluation data in order to show the generality of results.
Conclusion
Automatic evaluation will make testing easier during system development and enable reporting results obtained outside of the cycles of NIST evaluation.
Introduction
quality and none have been validated on data from NIST evaluations.
Introduction
We evaluate the predictive power of these linguistic quality metrics by training and testing models on consecutive years of NIST evaluations (data described
Results and discussion
In both DUC 2006 and DUC 2007, ten NIST assessors wrote summaries for the various inputs.
Results and discussion
We only report results on the input level, as we are interested in distinguishing between the quality of the summaries, not the NIST assessors’ writing skills.
NIST is mentioned in 6 sentences in this paper.
Topics mentioned in this paper:
Setiawan, Hendra and Kan, Min Yen and Li, Haizhou and Resnik, Philip
Experimental Setup
We trained the system on the NIST MT06 Eval corpus excluding the UN data (approximately 900K sentence pairs).
Experimental Setup
We used the NIST MT03 test set as the development set for optimizing interpolation weights using minimum error rate training (MERT; (Och and Ney, 2002)).
Experimental Setup
We carried out evaluation of the systems on the NIST 2006 evaluation test (MT06) and the NIST 2008 evaluation test (MT08).
NIST is mentioned in 6 sentences in this paper.
Topics mentioned in this paper:
Zarriess, Sina and Kuhn, Jonas
Experiments
NIST , sentence-level n- gram overlap weighted in favour of less frequent n- grams, as in (Belz et al., 2011)
Experiments
score for the REG—>LIN system comes close to the upper bound that applies linearization on linSynJflae, gold shallow trees with gold REs (BLEUT of 72.4), whereas the difference in standard BLEU and NIST is high.
Experiments
Input System BLEU NIST BLEUT
NIST is mentioned in 6 sentences in this paper.
Topics mentioned in this paper:
Xiong, Deyi and Zhang, Min
Conclusion
o The sense-based translation model is able to substantially improve translation quality in terms of both BLEU and NIST .
Experiments
We used the NIST MT03 evaluation test data as our development set, and the NIST MT05 as the test set.
Experiments
We evaluated translation quality with the case-insensitive BLEU-4 (Papineni et al., 2002) and NIST (Doddington, 2002).
Experiments
System BLEU(%) NIST STM (i5w) 34.64 9.4346 STM (i10w) 34.76 9.5114 STM (i15w) - -
NIST is mentioned in 6 sentences in this paper.
Topics mentioned in this paper:
Hewavitharana, Sanjika and Mehay, Dennis and Ananthakrishnan, Sankaranarayanan and Natarajan, Prem
Abstract
On an English-to-Iraqi CSLT task, the proposed approach gives significant improvements over a baseline system as measured by BLEU, TER, and NIST .
Experimental Setup and Results
Table 1 summarizes test set performance in BLEU (Papineni et a1., 2001), NIST (Doddington, 2002) and TER (Snover et a1., 2006).
Experimental Setup and Results
In the ASR setting, which simulates a real-world deployment scenario, this system achieves improvements of 0.39 (BLEU), -0.6 (TER) and 0.08 ( NIST ).
Incremental Topic-Based Adaptation
1 REFERENCE TRANSCRIPTIONS SYSTEM 1 BLEUT 1 TER1 1 NISTT
Incremental Topic-Based Adaptation
SYSTEM 1 BLEUT 1 TER1 1 NISTT
Introduction
With this approach, we demonstrate significant improvements over a baseline phrase-based SMT system as measured by BLEU, TER and NIST scores on an English-to-Iraqi CSLT task.
NIST is mentioned in 6 sentences in this paper.
Topics mentioned in this paper:
Chen, Boxing and Kuhn, Roland and Foster, George
Abstract
Experiments on large scale NIST evaluation data show improvements over strong baselines: +1.8 BLEU on Arabic to English and +1.4 BLEU on Chinese to English over a non-adapted baseline, and significant improvements in most circumstances over baselines with linear mixture model adaptation.
Experiments
We carried out experiments in two different settings, both involving data from NIST Open MT 2012.2 The first setting is based on data from the Chinese to English constrained track, comprising about 283 million English running words.
Experiments
The development set (tune) was taken from the NIST 2005 evaluation set, augmented with some web-genre material reserved from other NIST corpora.
Experiments
Table 2: NIST Arabic-English data.
Vector space model adaptation
Table 1: NIST Chinese-English data.
NIST is mentioned in 6 sentences in this paper.
Topics mentioned in this paper:
Lu, Shixiang and Chen, Zhenbiao and Xu, Bo
Abstract
On two Chinese-English tasks, our semi-supervised DAE features obtain statistically significant improvements of l.34/2.45 (IWSLT) and 0.82/1.52 ( NIST ) BLEU points over the unsupervised DBN features and the baseline features, respectively.
Conclusions
The results also demonstrate that DNN (DAE and HCDAE) features are complementary to the original features for SMT, and adding them together obtain statistically significant improvements of 3.16 (IWSLT) and 2.06 ( NIST ) BLEU points over the baseline features.
Experiments and Results
NIST .
Experiments and Results
Our development set is NIST 2005 MT evaluation set (1084 sentences), and our test set is NIST 2006 MT evaluation set (1664 sentences).
Experiments and Results
Adding new DNN features as extra features significantly improves translation accuracy (row 2-17 vs. 1), with the highest increase of 2.45 (IWSLT) and 1.52 ( NIST ) (row 14 vs. 1) BLEU points over the baseline features.
Introduction
Finally, we conduct large-scale experiments on IWSLT and NIST Chinese-English translation tasks, respectively, and the results demonstrate that our solutions solve the two aforementioned shortcomings successfully.
NIST is mentioned in 6 sentences in this paper.
Topics mentioned in this paper:
Eidelman, Vladimir and Boyd-Graber, Jordan and Resnik, Philip
Experiments
The second setting uses the non-UN and non-HK Hansards portions of the NIST training corpora with LTM only.
Experiments
En Zh FBIS 269K 10.3M 7.9M NIST 1.6M 44.4M 40.4M
Experiments
2010) as our decoder, and tuned the parameters of the system to optimize BLEU (Papineni et al., 2002) on the NIST MT06 tuning corpus using the Margin Infused Relaxed Algorithm (MIRA) (Crammer et al., 2006; Eidelman, 2012).
NIST is mentioned in 6 sentences in this paper.
Topics mentioned in this paper:
Zeng, Xiaodong and Chao, Lidia S. and Wong, Derek F. and Trancoso, Isabel and Tian, Liang
Experiments
We adopted three state-of-the-art metrics, BLEU (Papineni et al., 2002), NIST (Doddington et al., 2000) and METEOR (Banerjee and Lavie, 2005), to evaluate the translation quality.
Experiments
The NIST evaluation campaign data, MT—03 and MT-05, are selected to comprise the MT development data, devMT, and testing data, testMT, respectively.
Experiments
NIST and METEOR over others.
NIST is mentioned in 5 sentences in this paper.
Topics mentioned in this paper:
Hu, Yuening and Zhai, Ke and Eidelman, Vladimir and Boyd-Graber, Jordan
Experiments
Dataset and SMT Pipeline We use the NIST MT Chinese-English parallel corpus (NIS T), excluding non-UN and non-HK Hansards portions as our training dataset.
Experiments
To optimize SMT system, we tune the parameters on NIST MT06, and report results on three test sets: MT02, MT03 and MT05.2
Experiments
Resources for Prior Tree To build the tree for tLDA and ptLDA, we extract the word correlations from a Chinese-English bilingual dictionary (Denisowski, 1997).4 We filter the dictionary using the NIST vocabulary, and keep entries mapping single Chinese and single English words.
NIST is mentioned in 5 sentences in this paper.
Topics mentioned in this paper:
Sun, Hong and Zhou, Ming
Abstract
Our experiments on NIST 2008 testing data with automatic evaluation as well as human judgments suggest that the proposed method is able to enhance the paraphrase quality by adjusting between semantic equivalency and surface dissimilarity.
Experiments and Results
We use 2003 NIST Open Machine Translation Evaluation data (NIST 2003) as development data (containing 919 sentences) for MERT and test the performance on NIST 2008 data set (containing 1357 sentences).
Experiments and Results
NIST Chinese-to-English evaluation data offers four English human translations for every Chinese sentence.
Experiments and Results
Table 1: iBLEU Score Results( NIST 2008)
Introduction
We test our method on NIST 2008 testing data.
NIST is mentioned in 5 sentences in this paper.
Topics mentioned in this paper:
Shen, Libin and Xu, Jinxi and Weischedel, Ralph
Conclusions and Future Work
Our string-to-dependency system generates 80% fewer rules, and achieves 1.48 point improvement in BLEU and 2.53 point improvement in TER on the decoding output on the NIST 04 Chinese-English evaluation set.
Experiments
We used part of the NIST 2006 Chinese-English large track data as well as some LDC corpora collected for the DARPA GALE program (LDC2005E83, LDC2006E34 and LDC2006G05) as our bilingual training data.
Experiments
We tuned the weights on NIST MT05 and tested on MT04.
Introduction
For example, Chiang (2007) showed that the Hiero system achieved about 1 to 3 point improvement in BLEU on the NIST 03/04/05 Chinese-English evaluation sets compared to a start-of-the-art phrasal system.
Introduction
Our string-to-dependency decoder shows 1.48 point improvement in BLEU and 2.53 point improvement in TER on the NIST 04 Chinese-English MT evaluation set.
NIST is mentioned in 5 sentences in this paper.
Topics mentioned in this paper:
Galley, Michel and Manning, Christopher D.
Abstract
Our results show that augmenting a state-of-the-art phrase-based system with this dependency language model leads to significant improvements in TER (0.92%) and BLEU (0.45%) scores on five NIST Chinese-English evaluation test sets.
Introduction
petitive phrase-based systems in large-scale experiments such as NIST evaluations.2 This lack of significant difference may not be completely surprising.
Introduction
2Results of the 2008 NIST Open MT evaluation (http://www.itl.nist.gov/iad/mig/tests/mt/2008/doc/ mt08_official_results_v0 .html) reveal that, while many of the best systems in the Chinese-English and Arabic-English tasks incorporate synchronous CFG models, score differences with the best phrase-based system were insignificantly small.
Machine translation experiments
For tuning and testing, we use the official NIST MT evaluation data for Chinese from 2002 to 2008 (MT02 to MT08), which all have four English references for each input sentence.
Machine translation experiments
Table 6 provides experimental results on the NIST test data (excluding the tuning set MTOS) for each of the three genres: newswire, web data, and speech (broadcast news and conversation).
NIST is mentioned in 5 sentences in this paper.
Topics mentioned in this paper:
Liu, Chang and Ng, Hwee Tou
Experiments
Table 1: Inter-judge Kappa for the NIST 2008 English—Chinese task
Experiments
4.2 NIST 2008 English-Chinese MT Task
Experiments
The NIST 2008 English-Chinese MT task consists of 127 documents with 1,830 segments, each with four reference translations and eleven automatic MT system translations.
Introduction
The work compared various MT evaluation metrics (BLEU, NIST , METEOR, GTM, 1 — TER) with different segmentation schemes, and found that treating every single character as a token (character-level MT evaluation) gives the best correlation with human judgments.
NIST is mentioned in 5 sentences in this paper.
Topics mentioned in this paper:
Setiawan, Hendra and Zhou, Bowen and Xiang, Bing and Shen, Libin
Abstract
On NIST MT08 set, our most advanced model brings around +2.0 BLEU and -1.0 TER improvement.
Experiments
As for the blind test set, we report the performance on the NIST MT08 evaluation set, which consists of 691 sentences from newswire and 666 sentences from weblog.
Experiments
Table 4 summarizes the experimental results on NIST MT08 newswire and weblog.
Experiments
Table 4: The NIST MT08 results on newswire (nw) and weblog (wb) genres.
NIST is mentioned in 4 sentences in this paper.
Topics mentioned in this paper:
Avramidis, Eleftherios and Koehn, Philipp
Experiments
Results were evaluated with both BLEU (Papineni et al., 2001) and NIST metrics ( NIST , 2002).
Experiments
BLEU NIST set devtest test07 devtest test07 baseline 18.13 18.05 5.218 5.279 person 18.16 18.17 5.224 5.316
Experiments
The NIST metric clearly shows a significant improvement, because it mostly measures difficult n-gram matches (e. g. due to the long-distance rules we have been dealing with).
NIST is mentioned in 4 sentences in this paper.
Topics mentioned in this paper:
Xiao, Tong and Zhu, Jingbo and Zhang, Chunliang
Abstract
We apply our approach to a state-of-the-art phrase-based system and demonstrate very promising BLEU improvements and TER reductions on the NIST Chinese-English MT evaluation data.
Conclusion and Future Work
The experimental results show that the proposed approach achieves very promising BLEU improvements and TER reductions on the NIST evaluation data.
Evaluation
We used the newswire portion of the NIST MT06 evaluation data as our development set, and used the evaluation data of MT04 and MTOS as our test sets.
Introduction
0 We apply the proposed model to Chinese-English phrase-based MT and demonstrate promising BLEU improvements and TER reductions on the NIST evaluation data.
NIST is mentioned in 4 sentences in this paper.
Topics mentioned in this paper:
Talbot, David and Brants, Thorsten
Experiments
We run an improved version of our 2006 NIST MT Evaluation entry for the Arabic-English “Unlimited” data track.6 The language model is the same one as in the previous section.
Experiments
We use MT04 data for system development, with MT05 data and MT06 ( “NIST” subset) data for blind testing.
Experiments
Overall, our baseline results compare favorably to those reported on the NIST MT06 web site.
NIST is mentioned in 4 sentences in this paper.
Topics mentioned in this paper:
Zhang, Min and Jiang, Hongfei and Aw, Aiti and Li, Haizhou and Tan, Chew Lim and Li, Sheng
Abstract
Experimental results on the NIST MT-2005 Chinese-English translation task show that our method statistically significantly outperforms the baseline systems.
Conclusions and Future Work
The experimental results on the NIST MT-2005 Chinese-English translation task demonstrate the effectiveness of the proposed model.
Experiments
We used sentences with less than 50 characters from the NIST MT-2002 test set as our development set and the NIST MT-2005 test set as our test set.
Introduction
Experiment results on the NIST MT-2005 Chinese-English translation task show that our method significantly outperforms Moses (Koehn et al., 2007), a state-of-the-art phrase-based SMT system, and other linguistically syntax-based methods, such as SCFG-based and STSG-based methods (Zhang et al., 2007).
NIST is mentioned in 4 sentences in this paper.
Topics mentioned in this paper:
Zhang, Hui and Zhang, Min and Li, Haizhou and Aw, Aiti and Tan, Chew Lim
Abstract
Experimental results on the NIST MT-2003 Chinese-English translation task show that our method statistically significantly outperforms the four baseline systems.
Conclusion
Finally, we examine our methods on the FBIS corpus and the NIST MT-2003 Chinese-English translation task.
Experiment
We use the FBIS corpus as training set, the NIST MT-2002 test set as development (deV) set and the NIST MT-2003 test set as test set.
Introduction
We evaluate our method on the NIST MT-2003 Chinese-English translation tasks.
NIST is mentioned in 4 sentences in this paper.
Topics mentioned in this paper:
Cai, Jingsheng and Utiyama, Masao and Sumita, Eiichiro and Zhang, Yujie
Abstract
We present a set of dependency-based pre-ordering rules which improved the BLEU score by 1.61 on the NIST 2006 evaluation data.
Experiments
Our development set was the official NIST MT evaluation data from 2002 to 2005, consisting of 4476 Chinese-English sentences pairs.
Experiments
Our test set was the NIST 2006 MT evaluation data, consisting of 1664 sentence pairs.
Introduction
Experiment results showed that our pre-ordering rule set improved the BLEU score on the NIST 2006 evaluation data by 1.61.
NIST is mentioned in 4 sentences in this paper.
Topics mentioned in this paper:
Xiong, Deyi and Zhang, Min and Li, Haizhou
Experiments
For the error detection task, we use the best translation hypotheses of NIST MT-02/05/03 generated by MOSES as our training, development, and test corpus respectively.
SMT System
The translation task is on the official NIST Chinese-to-English evaluation data.
SMT System
For minimum error rate tuning (Och, 2003), we use NIST MT-02 as the development set for the translation task.
SMT System
In order to calculate word posterior probabilities, we generate 10,000 best lists for NIST MT-02/03/05 respectively.
NIST is mentioned in 4 sentences in this paper.
Topics mentioned in this paper:
Visweswariah, Karthik and Khapra, Mitesh M. and Ramanathan, Ananthakrishnan
Abstract
The data generated allows us to train a reordering model that gives an improvement of 1.8 BLEU points on the NIST MT—08 Urdu-English evaluation set over a reordering model that only uses manual word alignments, and a gain of 5.2 BLEU points over a standard phrase-based baseline.
Experimental setup
We use about 10K sentences (180K words) of manual word alignments which were created in house using part of the NIST MT—08 training data3 to train our baseline reordering model and to train our supervised machine aligners.
Experimental setup
We use a parallel corpus of 3.9M words consisting of 1.7M words from the NIST MT—08 training data set and 2.2M words extracted from parallel news stories on the
Experimental setup
We report results on the (four reference) NIST MT—08 evaluation set in Table 4 for the News and Web conditions.
NIST is mentioned in 4 sentences in this paper.
Topics mentioned in this paper:
Chen, Boxing and Kuhn, Roland and Larkin, Samuel
Experiments
The dev set comprised mainly data from the NIST 2005 test set, and also some balanced-genre web-text from NIST .
Experiments
Evaluation was performed on NIST 2006 and 2008.
Experiments
Table 10: Ordering scores (p, I and v) for test sets NIST
Introduction
0 BLEU (Papineni et al., 2002), NIST (Doddington, 2002), WER, PER, TER (Snover et al., 2006), and LRscore (Birch and Osborne, 2011) do not use external linguistic
NIST is mentioned in 4 sentences in this paper.
Topics mentioned in this paper:
Li, Junhui and Tu, Zhaopeng and Zhou, Guodong and van Genabith, Josef
Abstract
Experiments on Chinese—English translation on four NIST MT test sets show that the HD—HPB model significantly outperforms Chiang’s model with average gains of 1.91 points absolute in BLEU.
Experiments
We train our model on a dataset with ~1.5M sentence pairs from the LDC dataset.2 We use the 2002 NIST MT evaluation test data (878 sentence pairs) as the development data, and the 2003, 2004, 2005, 2006-news NIST MT evaluation test data (919, 1788, 1082, and 616 sentence pairs, respectively) as the test data.
Experiments
For evaluation, the NIST BLEU script (version 12) with the default settings is used to calculate the BLEU scores.
Introduction
Experiments on Chinese-English translation using four NIST MT test sets show that our HD-HPB model significantly outperforms Chiang’s HPB as well as a SAMT—style refined version of HPB.
NIST is mentioned in 4 sentences in this paper.
Topics mentioned in this paper:
Xiao, Xinyan and Xiong, Deyi and Zhang, Min and Liu, Qun and Lin, Shouxun
Abstract
We show that our model significantly improves the translation performance over the baseline on NIST Chinese-to-English translation experiments.
Experiments
We present our experiments on the NIST Chinese-English translation tasks.
Experiments
We used the NIST evaluation set of 2005 (MT05) as our development set, and sets of MT06/MT08 as test sets.
Experiments
Case-insensitive NIST BLEU (Papineni et al., 2002) was used to mea-
NIST is mentioned in 4 sentences in this paper.
Topics mentioned in this paper:
Green, Spence and Wang, Sida and Cer, Daniel and Manning, Christopher D.
Analysis
The mass is concentrated along the diagonal, probably because MT05/6/8 was prepared by NIST , an American agency, while the bitext was collected from many sources including Agence France Presse.
Experiments
4.3 NIST OpenMT Experiment
Experiments
However, the bitext5k models do not generalize as well to the NIST evaluation sets as represented by the MT04 result.
Introduction
The first experiment uses standard tuning and test sets from the NIST OpenMT competitions.
NIST is mentioned in 4 sentences in this paper.
Topics mentioned in this paper:
Zhang, Hao and Fang, Licheng and Xu, Peng and Wu, Xiaoyun
Abstract
Combining the two techniques, we show that using a fast shift-reduce parser we can achieve significant quality gains in NIST 2008 English-to-Chinese track (1.3 BLEU points over a phrase-based system, 0.8 BLEU points over a hierarchical phrase-based system).
Experiments
For English-to-Chinese translation, we used all the allowed training sets in the NIST 2008 constrained track.
Experiments
For NIST , we filtered out sentences exceeding 80 words in the parallel texts.
NIST is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Feng, Yang and Cohn, Trevor
Experiments
Here the training data consists of the non-UN portions and non-HK Hansards portions of the NIST training corpora distributed by the LDC, totalling 303k sentence pairs with 8m and 9.4m words of Chinese and English, respectively.
Experiments
For the development set we use the NIST 2002 test set, and evaluate performance on the test sets from NIST 2003
Experiments
We evaluate on the NIST test sets from 2003 and 2005, and the 2002 test set was used for MERT training.
NIST is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Eidelman, Vladimir and Marton, Yuval and Resnik, Philip
Additional Experiments
For training, we used the non-UN portion of the NIST training corpora, which was segmented using an HMM segmenter (Lee et al., 2003).
Experiments
For training we used the non-UN and non-HK Hansards portions of the NIST training corpora, which was segmented using the Stanford segmenter (Tseng et al., 2005).
Experiments
We used cdec (Dyer et al., 2010) as our hierarchical phrase-based decoder, and tuned the parameters of the system to optimize BLEU (Papineni et al., 2002) on the NIST MT06 corpus.
NIST is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
He, Wei and Wang, Haifeng and Guo, Yuqing and Liu, Ting
Experiments
In addition to BLEU score, percentage of exactly matched sentences and average NIST simple string accuracy (SSA) are adopted as evaluation metrics.
Experiments
The average NIST simple string accuracy score reflects the average number of insertion (I), deletion (D), and substitution (5) errors between the output sentence and the reference sentence.
Log-linear Models
3 The BLEU scoring script is supplied by NIST Open Machine Translation Evaluation at ftp://iaguarncsl.nist.gov/mt/resources/mteval-vl lb.pl
NIST is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Liu, Yang
Abstract
As our approach combines the merits of phrase-based and string-to-dependency models, it achieves significant improvements over the two baselines on the NIST Chinese-English datasets.
Introduction
We evaluate our method on the NIST Chinese-English translation datasets.
Introduction
We used the 2002 NIST MT Chinese-English dataset as the development set and the 2003-2005 NIST datasets as the testsets.
NIST is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Braslavski, Pavel and Beloborodov, Alexander and Khalilov, Maxim and Sharoff, Serge
Evaluation methodology
In addition to human evaluation, we also ran system-level automatic evaluations using BLEU (Papineni et al., 2001), NIST (Doddington, 2002), METEOR (Banerjee and Lavie, 2005), TER (Snover et al., 2009), and GTM (Turian et al., 2003).
Results
The lower part of Table 2 also reports the results of simulated dynamic ranking (using the NIST rankings as the initial order for the sort operation).
Results
Sentence level Corpus Metric Median Mean Trimmed level BLEU 0.357 0.298 0.348 0.833 NIST 0.357 0.291 0.347 0.810 Meteor 0.429 0.348 0.393 0.714 TER 0.214 0.186 0.204 0.619 GTM 0.429 0.340 0.392 0.714
NIST is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Zarriess, Sina and Cahill, Aoife and Kuhn, Jonas
Experimental Setup
NIST 13.01 12.95 12.69 Match 27.91 27.66 26.38 Ling.
Experimental Setup
Model BLEU 0.764 0.759 0.747 NIST 13.18 13.14 13.01
Experimental Setup
use several standard measures: a) exact match: how often does the model select the original corpus sentence, b) BLEU: n-gram overlap between top-ranked and original sentence, c) NIST : modification of BLEU giving more weight to less frequent n-grams.
NIST is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Salameh, Mohammad and Cherry, Colin and Kondrak, Grzegorz
Experimental Setup
We train our English-to-Arabic system using 1.49 million sentence pairs drawn from the NIST 2012 training set, excluding the UN data.
Experimental Setup
We tune on the NIST 2004 evaluation set (1353 sentences) and evaluate on NIST 2005 (1056 sentences).
Results
Judging from the output on the NIST 2005 test set, the system uses these discontiguous desegmentations very rarely: only 5% of desegmented tokens align to discontiguous source phrases.
NIST is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Wang, Xiaolin and Utiyama, Masao and Finch, Andrew and Sumita, Eiichiro
Abstract
Experimental results show that the proposed method is comparable to supervised segmenters on the in-domain NIST OpenMT corpus, and yields a 0.96 BLEU relative increase on NTCIR PatentMT corpus which is out-of-domain.
Complexity Analysis
The first bilingual corpus: OpenMT06 was used in the NIST open machine translation 2006 Evaluation 2.
Complexity Analysis
The data sets of NIST Eval 2002 to 2005 were used as the development for MERT tuning (Och, 2003).
NIST is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Wintrode, Jonathan and Khudanpur, Sanjeev
Introduction
The BABEL task is modeled on the 2006 NIST Spoken Term Detection evaluation ( NIST , 2006) but focuses on limited resource conditions.
Results
At our disposal, we have the five BABEL languages — Tagalog, Cantonese, Pashto, Turkish and Vietnamese — as well as the development data from the NIST 2006 English evaluation.
Term Detection Re-scoring
The primary metric for the BABEL program, Actual Term Weighted Value (ATWV) is defined by NIST using a cost function of the false alarm probability P(FA) and P(l\/liss), averaged over a set of queries ( NIST , 2006).
NIST is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Wu, Xianchao and Sudoh, Katsuhito and Duh, Kevin and Tsukada, Hajime and Nagata, Masaaki
Experiments
For Chinese-to-English translation, we use the parallel data from NIST Open Machine Translation Evaluation tasks.
Experiments
The NIST 2003 and 2005 test data are respectively taken as the development and test set.
Introduction
(2008) and achieved state-of-the-art results as reported in the NIST 2008 Open MT Evaluation workshop and the NTCIR-9 Chinese-to-English patent translation task (Goto et al., 2011; Ma and Matsoukas, 2011).
NIST is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
He, Wei and Wu, Hua and Wang, Haifeng and Liu, Ting
Experiments
develop NIST 2002 878 10 NIST 2005 1,082 4 NIST 2004 1,788 5 test NIST 2006 1,664 4 NIST 2008 1,357 4
Experiments
The system was tested using the Chinese-English MT evaluation sets of NIST 2004, NIST 2006 and NIST 2008.
Experiments
For development, we used the Chinese-English MT evaluation sets of NIST 2002 and NIST 2005.
NIST is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Nguyen, ThuyLinh and Vogel, Stephan
Experiment Results
The language model is the interpolation of 5-gram language models built from news corpora of the NIST 2012 evaluation.
Experiment Results
We tuned the parameters on the MT06 NIST test set (1664 sentences) and report the BLEU scores on three unseen test sets: MT04 (1353 sentences), MT05 (1056 sentences) and MT09 (1313 sentences).
Experiment Results
We tuned the parameters on MT06 NIST test set of 1664 sentences and report the results of MT04, MT05 and MT08 unseen test sets.
NIST is mentioned in 3 sentences in this paper.
Topics mentioned in this paper: