Experimental Setup | Our automatic evaluation was based on Translation Edit Rate ( TER , Snover et al. |
Experimental Setup | TER is defined as the minimum number of edits a human would have to perform to change the system output so that it exactly matches a reference translation. |
Experimental Setup | TER <E7 Er) : Ins + Del + Sub + Shft (16) M |
Results | Table 2 reports our results on the test set using TER . |
Results | The abstractive models obtain the best TER scores overall, however they generate shorter captions in comparison to the other models (closer to the length of the gold standard) and as a result TER treats them favorably, simply because the number of edits is less. |
Background | Diversity ( TER [%]) |
Background | Diversity ( TER [%]) |
Background | The diversity is measured in terms of the Translation Error Rate ( TER ) metric proposed in (Snover et al., 2006). |
Conclusion | In TER , improvements are 0.4 and 1.7 points. |
Experimental Evaluation | ‘ BLEU ‘ TER ‘ |
Experimental Evaluation | The metrics used for evaluation are the case-sensitive BLEU (Papineni et al., 2002) score and the translation edit rate ( TER ) (Snover et al., 2006) with one reference translation. |
Experimental Evaluation | A second iteration of the training algorithm shows nearly no changes in BLEU score, but a small improvement in TER . |
Conclusion | 14 Functor 0.21 0.40 0.09 Voidpar 0.16 0.53 -0.08 PER 0.12 0.53 -0.09 TER 0.07 0.53 -0.23 |
Extensions of SemPOS | NIST 0.69 0.90 0.53 semPOSsons SemPOS 0.69 0.95 0.30 2-SemPOS+l -BLEU4 0.68 0.91 0.09 BLEU1 0.68 0.87 0.43 BLEU2 0.68 0.90 0.26 BLEU3 0.66 0.90 0.14 BLEU 0.66 0.91 0.20 TER 0.63 0.87 0.29 PER 0.63 0.88 0.32 BLEU4 0.61 0.90 -0.31 Functorpar 0.57 0.83 -0.03 Functor 0.55 0.82 -0.09 |
Extensions of SemPOS | The error metrics PER and TER showed the lowest correlation with human judgments for translation to Czech. |