Index of papers in Proc. ACL 2009 that mention
  • NIST
Li, Mu and Duan, Nan and Zhang, Dongdong and Li, Chi-Ho and Zhou, Ming
Abstract
Experimental results on data sets for NIST Chinese-to—English machine translation task show that the co-decoding method can bring significant improvements to all baseline decoders, and the outputs from co-decoding can be used to further improve the result of system combination.
Experiments
We conduct our experiments on the test data from the NIST 2005 and NIST 2008 Chinese-to-English machine translation tasks.
Experiments
The NIST 2003 test data is used for development data to estimate model parameters.
Experiments
In our experiments all the models are optimized with case-insensitive NIST version of BLEU score and we report results using this metric in percentage numbers.
Introduction
We will present experimental results on the data sets of NIST Chinese-to-English machine translation task, and demonstrate that co-decoding can bring significant improvements to baseline systems.
NIST is mentioned in 16 sentences in this paper.
Topics mentioned in this paper:
Pado, Sebastian and Galley, Michel and Jurafsky, Dan and Manning, Christopher D.
Abstract
We compare this metric against a combination metric of four state—of—the—art scores (BLEU, NIST , TER, and METEOR) in two different settings.
EXpt. 1: Predicting Absolute Scores
Our first experiment evaluates the models we have proposed on a corpus with traditional annotation on a seven-point scale, namely the NIST OpenMT 2008 corpus.4 The corpus contains translations of newswire teXt into English from three source languages (Arabic (Ar), Chinese (Ch), Urdu (Ur)).
EXpt. 1: Predicting Absolute Scores
BLEUR, METEORR, and NISTR significantly predict one language each (all Arabic); TERR, MTR, and RTER predict two languages.
Experimental Evaluation
NISTR consists of 16 features.
Experimental Evaluation
NIST-n scores (1 g n g 10) and information-weighted n-gram precision scores (1 g n g 4); NIST brevity penalty (BP); and NIST score divided by BP.
Expt. 2: Predicting Pairwise Preferences
1: Among individual metrics, METEORR and TERR do better than BLEUR and NISTR .
Expt. 2: Predicting Pairwise Preferences
NISTR 50.2 70.4
Expt. 2: Predicting Pairwise Preferences
Again, we see better results for METEORR and TERR than for BLEUR and NISTR , and the individual metrics do worse than the combination models.
Introduction
Since human evaluation is costly and difficult to do reliably, a major focus of research has been on automatic measures of MT quality, pioneered by BLEU (Papineni et a1., 2002) and NIST (Doddington, 2002).
Introduction
BLEU and NIST measure MT quality by using the strong correlation between human judgments and the degree of n-gram overlap between a system hypothesis translation and one or more reference translations.
NIST is mentioned in 10 sentences in this paper.
Topics mentioned in this paper:
Kumar, Shankar and Macherey, Wolfgang and Dyer, Chris and Och, Franz
Discussion
Table 5: MBR Parameter Tuning on NIST systems
Experiments
The first one is the constrained data track of the NIST Arabic-to-English (aren) and Chinese-to-English (zhen) translation taskl.
Experiments
Table 1: Statistics over the NIST dev/test sets.
Experiments
Our development set (dev) consists of the NIST 2005 eval set; we use this set for optimizing MBR parameters.
NIST is mentioned in 9 sentences in this paper.
Topics mentioned in this paper:
Amigó, Enrique and Giménez, Jesús and Gonzalo, Julio and Verdejo, Felisa
Alternatives to Correlation-based Meta-evaluation
NIST 5 .70 randOST 5 .20 minOST 3.67
Correlation with Human Judgements
nist .
Metrics and Test Beds
At the lexical level, we have included several standard metrics, based on different similarity assumptions: edit distance (WER, PER and TER), lexical precision (BLEU and NIST ), lexical recall (ROUGE), and F-measure (GTM and METEOR).
Metrics and Test Beds
Table 1: NIST 2004/2005 MT Evaluation Campaigns.
Metrics and Test Beds
We use the test beds from the 2004 and 2005 NIST MT Evaluation Campaigns (Le and Przy-bocki, 2005)2.
Previous Work on Machine Translation Meta-Evaluation
With the aim of overcoming some of the deficiencies of BLEU, Doddington (2002) introduced the NIST metric.
Previous Work on Machine Translation Meta-Evaluation
Lin and Och (2004) experimented, unlike previous works, with a wide set of metrics, including NIST , WER (NieBen et al., 2000), PER (Tillmann et al., 1997), and variants of ROUGE, BLEU and GTM.
NIST is mentioned in 7 sentences in this paper.
Topics mentioned in this paper:
Setiawan, Hendra and Kan, Min Yen and Li, Haizhou and Resnik, Philip
Experimental Setup
We trained the system on the NIST MT06 Eval corpus excluding the UN data (approximately 900K sentence pairs).
Experimental Setup
We used the NIST MT03 test set as the development set for optimizing interpolation weights using minimum error rate training (MERT; (Och and Ney, 2002)).
Experimental Setup
We carried out evaluation of the systems on the NIST 2006 evaluation test (MT06) and the NIST 2008 evaluation test (MT08).
NIST is mentioned in 6 sentences in this paper.
Topics mentioned in this paper:
Galley, Michel and Manning, Christopher D.
Abstract
Our results show that augmenting a state-of-the-art phrase-based system with this dependency language model leads to significant improvements in TER (0.92%) and BLEU (0.45%) scores on five NIST Chinese-English evaluation test sets.
Introduction
petitive phrase-based systems in large-scale experiments such as NIST evaluations.2 This lack of significant difference may not be completely surprising.
Introduction
2Results of the 2008 NIST Open MT evaluation (http://www.itl.nist.gov/iad/mig/tests/mt/2008/doc/ mt08_official_results_v0 .html) reveal that, while many of the best systems in the Chinese-English and Arabic-English tasks incorporate synchronous CFG models, score differences with the best phrase-based system were insignificantly small.
Machine translation experiments
For tuning and testing, we use the official NIST MT evaluation data for Chinese from 2002 to 2008 (MT02 to MT08), which all have four English references for each input sentence.
Machine translation experiments
Table 6 provides experimental results on the NIST test data (excluding the tuning set MTOS) for each of the three genres: newswire, web data, and speech (broadcast news and conversation).
NIST is mentioned in 5 sentences in this paper.
Topics mentioned in this paper:
Zhang, Hui and Zhang, Min and Li, Haizhou and Aw, Aiti and Tan, Chew Lim
Abstract
Experimental results on the NIST MT-2003 Chinese-English translation task show that our method statistically significantly outperforms the four baseline systems.
Conclusion
Finally, we examine our methods on the FBIS corpus and the NIST MT-2003 Chinese-English translation task.
Experiment
We use the FBIS corpus as training set, the NIST MT-2002 test set as development (deV) set and the NIST MT-2003 test set as test set.
Introduction
We evaluate our method on the NIST MT-2003 Chinese-English translation tasks.
NIST is mentioned in 4 sentences in this paper.
Topics mentioned in this paper:
He, Wei and Wang, Haifeng and Guo, Yuqing and Liu, Ting
Experiments
In addition to BLEU score, percentage of exactly matched sentences and average NIST simple string accuracy (SSA) are adopted as evaluation metrics.
Experiments
The average NIST simple string accuracy score reflects the average number of insertion (I), deletion (D), and substitution (5) errors between the output sentence and the reference sentence.
Log-linear Models
3 The BLEU scoring script is supplied by NIST Open Machine Translation Evaluation at ftp://iaguarncsl.nist.gov/mt/resources/mteval-vl lb.pl
NIST is mentioned in 3 sentences in this paper.
Topics mentioned in this paper: