Abstract | Using parse accuracy in a simple reranking strategy for self-monitoring, we find that with a state-of-the-art averaged perceptron realization ranking model, BLEU scores cannot be improved with any of the well-known Treebank parsers we tested, since these parsers too often make errors that human readers would be unlikely to make. |
Abstract | However, by using an SVM ranker to combine the realizer’s model score together with features from multiple parsers, including ones designed to make the ranker more robust to parsing mistakes, we show that significant increases in BLEU scores can be achieved. |
Introduction | With this simple reranking strategy and each of three different Treebank parsers, we find that it is possible to improve BLEU scores on Penn Treebank development data with White & Rajkumar’s (2011; 2012) baseline generative model, but not with their averaged perceptron model. |
Introduction | With the SVM reranker, we obtain a significant improvement in BLEU scores over |
Introduction | Additionally, in a targeted manual analysis, we find that in cases where the SVM reranker improves the BLEU score, improvements to fluency and adequacy are roughly balanced, while in cases where the BLEU score goes down, it is mostly fluency that is made worse (with reranking yielding an acceptable paraphrase roughly one third of the time in both cases). |
Reranking with SVMs 4.1 Methods | In training, we used the BLEU scores of each realization compared with its reference sentence to establish a preference order over pairs of candidate realizations, assuming that the original corpus sentences are generally better than related alternatives, and that BLEU can somewhat reliably predict human preference judgments. |
Reranking with SVMs 4.1 Methods | The complete model, BBS+dep+nbest, achieved a BLEU score of 88.73, significantly improving upon the perceptron model (p < 0.02). |
Simple Reranking | Table 2: Devset BLEU scores for simple ranking on top of n-best perceptron model realizations |
Simple Reranking | Simple ranking with the Berkeley parser of the generative model’s n-best realizations raised the BLEU score from 85.55 to 86.07, well below the averaged perceptron model’s BLEU score of 87.93. |
Evaluation | In the following sections, we evaluate each of our methods by calculating BLEU scores against the same four sets of three reference translations. |
Evaluation | This allows us to compare the BLEU score achieved by our methods against the BLEU scores achievable by professional translators. |
Evaluation | As expected, random selection yields bad performance, with a BLEU score of 30.52. |
Abstract | We present a set of dependency-based pre-ordering rules which improved the BLEU score by 1.61 on the NIST 2006 evaluation data. |
Conclusion | The results showed that our approach achieved a BLEU score gain of 1.61. |
Dependency-based Pre-ordering Rule Set | In the primary experiments, we tested the effectiveness of the candidate rules and filtered the ones that did not work based on the BLEU scores on the development set. |
Experiments | For evaluation, we used BLEU scores (Papineni et al., 2002). |
Experiments | It shows the BLEU scores on the test set and the statistics of pre-ordering on the training set, which includes the total count of each rule set and the number of sentences they were ap- |
Introduction | Experiment results showed that our pre-ordering rule set improved the BLEU score on the NIST 2006 evaluation data by 1.61. |
Our Approach | BLEU Scores 13 N J: |
Our Approach | Figure l: BLEU scores vs k for SumBasic extraction. |
Our Approach | As shown in Figure 1, our system’s BLEU scores increase rapidly until about k = 25. |
Conclusion and Future Work | We plan to give different weights to different training examples based on the drop in BLEU score the example can cause if classified incorrectly. |
MT System Selection | We run the 5,562 sentences of the classification training data through our four MT systems and produce sentence-level BLEU scores (with length penalty). |
MT System Selection | We pick the name of the MT system with the highest BLEU score as the class label for that sentence. |
MT System Selection | When there is a tie in BLEU scores, we pick the system label that yields better overall BLEU scores from the systems tied. |
Machine Translation Experiments | All differences in BLEU scores between the four systems are statistically significant above the 95% level. |
Machine Translation Experiments | We also report in Table 1 an oracle system selection where we pick, for each sentence, the English translation that yields the best BLEU score . |
Conclusion | We observed that this often fails to return the best output in terms of BLEU score , fluency, grammaticality and/or meaning. |
Results and Discussion | Figure 6: BLEU scores and Grammar Size (Number of Elementary TAG trees |
Results and Discussion | The average BLEU score is given with respect to all input (All) and to those inputs for which the systems generate at least one sentence (Covered). |
Results and Discussion | In terms of BLEU score , the best version of our system (AUTEXP) outperforms the probabilistic approach of IMS by a large margin (+0.17) and produces results similar to the fully handcrafted UDEL system (-(). |
Experiments | BLEU score We used Moses support tools: multi-bleu10 to calculate BLEU scores . |
Experiments | The BLEU scores shown in Table 4 show that our system produces simplifications that are closest to the reference. |
Experiments | In sum, the automatic metrics indicate that our system produces simplification that are consistently closest to the reference in terms of edit distance, number of splits and BLEU score . |
Related Work | (2010) namely, an aligned corpus of 100/131 EWKP/SWKP sentences and show that they achieve better BLEU score . |
Experiments | In Table 3, almost all BLEU scores are improved, no matter what strategy is used. |
Experiments | The final BLEU scores on NIST05 and NIST06 are given in Table 4. |
Experiments | BLEU scores on the large-scale training data. |
Experiments & Results | The BLEU scores , not included in the figure but shown in Table 2, show a similar trend. |
Experiments & Results | Statistical significance on the BLEU scores was tested using pairwise bootstrap sampling (Koehn, 2004). |
Experiments & Results | Another discrepancy is found in the BLEU scores of the English—>Chinese experiments, where we measure an unexpected drop in BLEU score under baseline. |
Complexity Analysis | It was set to 3 for the monolingual unigram model, and 2 for the bilingual unigram model, which provided slightly higher BLEU scores on the development set than the other settings. |
Complexity Analysis | Table 4 presents the BLEU scores for Moses using different segmentation methods. |
Introduction | o improvement of BLEU scores compared to supervised Stanford Chinese word segmenter. |