Introduction | Good performance on this task is the most desired property of evaluation metrics during system development. |
Results and discussion | These input-level accuracies compare favorably with automatic evaluation metrics for other natural language processing tasks. |
Results and discussion | For example, at the 2008 ACL Workshop on Statistical Machine Translation, all fifteen automatic evaluation metrics , including variants of BLEU scores, achieved between 42% and 56% pairwise accuracy with human judgments at the sentence level (Callison-Burch et al., 2008). |