Abstract | As machine translation systems improve in lexical choice and fluency, the shortcomings of widespread n-gram based, fluency-oriented MT evaluation metrics such as BLEU, which fail to properly evaluate adequacy, become more apparent. |
Abstract | But more accurate, nonautomatic adequacy-oriented MT evaluation metrics like HTER are highly labor-intensive, which bottlenecks the evaluation cycle. |
Abstract | We then replace the human semantic role annotators with automatic shallow semantic parsing to further automate the evaluation metric, and show that even the semiautomated evaluation metric achieves a 0.34 correlation coefficient with human adequacy judgment, which is still about 80% as closely correlated as HTER despite an even lower labor co st for the evaluation procedure. |
Abstract | A lack of standard datasets and evaluation metrics has prevented the field of paraphrasing from making the kind of rapid progress enjoyed by the machine translation community over the last 15 years. |
Introduction | However, a lack of standard datasets and automatic evaluation metrics has impeded progress in the field. |
Introduction | Second, we define a new evaluation metric , PINC (Paraphrase In N- gram Changes), that relies on simple BLEU-like n—gram comparisons to measure the degree of novelty of automatically generated paraphrases. |
Paraphrase Evaluation Metrics | A good paraphrase, according to our evaluation metric , has few n-gram overlaps with the source sentence but many n- gram overlaps with the reference sentences. |
Related Work | The more recently proposed metric PEM (Paraphrase Evaluation Metric ) (Liu et al., 2010) produces a single score that captures the semantic adequacy, fluency, and lexical dissimilarity of candidate paraphrases, relying on bilingual data to learn semantic equivalences without using n- gram similarity between candidate and reference sentences. |