Abstract | On top of the pruning framework, we also propose a discriminative ITG alignment model using hierarchical phrase pairs, which improves both F-score and Bleu score over the baseline alignment system of GIZA++. |
Evaluation | An alternative criterion is the upper bound on alignment F-score , which essentially measures how many links in annotated alignment can be kept in ITG parse. |
Evaluation | The calculation of F-score upper bound is done in a bottom-up way like ITG parsing. |
Evaluation | The upper bound of alignment F-score can thus be calculated as well. |
The DITG Models | The MERT module for DITG takes alignment F-score of a sentence pair as the performance measure. |
The DITG Models | Given an input sentence pair and the reference annotated alignment, MERT aims to maximize the F-score of DITG-produced alignment. |
Introduction | Using an adapted supertagger with ambiguity levels tuned to match the baseline system, we were also able to increase F-score on labelled grammatical relations by 0.75%. |
Results | Interestingly, while the decrease in supertag accuracy in the previous experiment did not translate into a decrease in F-score, the increase in tag accuracy here does translate into an increase in F-score . |
Results | The increase in F-score has two sources. |
Results | As Table 6 shows, this change translates into an improvement of up to 0.75% in F-score on Section |
Experiments | Ve only compare the F-score , since all the com-tared systems have an attempted rate7 of 1.0, |
Experiments | ), F-score (Fl). |
Experiments | System F-score |
Experimental Setup | g 03 — Gj _ 0.25 — _ EX" _Q_ Q. G {x {T __ 0.2 - X _ EX 0.15 - _ Recall Precision F-score Recall Precision F-score Rouge-1 Rouge-L |
Results | F-score is higher for the phrase-based system but not significantly. |
Results | The sentence ILP model outperforms the lead baseline with respect to recall but not precision or F-score . |
Results | The phrase ILP achieves a significantly better F-score over the lead baseline with both ROUGE-l and ROUGE-L. |
Evaluation | We calculated precision, recall, and f-score for our system, the baselines, and the upper bound as follows, with allsystem being the number of pairs labelled as paraphrase or happens-before, allgold as the respective number of pairs in the gold standard and correct as the number of pairs labeled correctly by the system. |
Evaluation | The f-score for the upper bound is in the column upper. |
Evaluation | For the f-score values, we calculated the significance for the difference between our system and the baselines as well as the upper bound, using a resampling test (Edgington, 1986). |
A Latent Variable CCG Parser | To determine statistical significance, we obtain p-values from Bikel’s randomized parsing evaluation comparator6, modified for use with tagging accuracy, F-score and dependency accuracy. |
A Latent Variable CCG Parser | In this section we evaluate the parsers using the traditional PARSEVAL measures which measure recall, precision and F-score on constituents in |
A Latent Variable CCG Parser | The Petrov parser has better results by a statistically significant margin for both labeled and unlabeled recall and unlabeled F-score . |
Intrinsic evaluation | We report the alignment quality in terms of precision, recall and F-score . |
Intrinsic evaluation | The F-score corresponding to perfect precision and the upper-bound recall is 94.75%. |
Intrinsic evaluation | Overall, the MM models obtain lower precision but higher recall and F-score than 1-1 models, which is to be expected as the gold standard is defined in terms of MM links. |