Abstract | We conduct experiments on new and existing gold-standard datasets to show the high quality and coverage of the resource. |
Experiment 1: Mapping Evaluation | The gold-standard dataset includes 505 nonempty mappings, i.e. |
Experiment 2: Translation Evaluation | This is assessed in terms of coverage against gold-standard resources (Section 5.1) and against a manually-validated dataset of translations (Section 5.2). |
Experiment 2: Translation Evaluation | Table 2: Size of the gold-standard wordnets. |
Experiment 2: Translation Evaluation | We compare BabelNet against gold-standard resources for 5 languages, namely: the subset of GermaNet (Lemnitzer and Kunze, 2002) included in EuroWordNet for German, MultiWordNet (Pianta et al., 2002) for Italian, the Multilingual Central Repository for Spanish and Catalan (Atserias et al., 2004), and WOrdnet Libre du Francais (Benoit and Fiser, 2008, WOLF) for French. |
Experiment: Ranking Word Senses | To compare the predicted ranking to the gold-standard ranking, we use Spearman’s p, a standard method to compare ranked lists to each other. |
Experiment: Ranking Word Senses | The first column shows the correlation of our model’s predictions with the human judgments from the gold-standard , averaged over all instances. |
Experiments: Ranking Paraphrases | We follow E&P and evaluate it only on the second subtask: we extract paraphrase candidates from the gold standard by pooling all annotated gold-standard paraphrases for all instances of a verb in all contexts, and use our model to rank these paraphrase candidates in specific contexts. |
Experiments: Ranking Paraphrases | P10 measures the percentage of gold-standard paraphrases in the top-ten list of paraphrases as ranked by the system, and can be defined as follows (McCarthy and Navigli, 2007): |
Experiments: Ranking Paraphrases | where M is the list of 10 paraphrase candidates top-ranked by the model, G is the corresponding annotated gold-standard data, and f (s) is the weight of the individual paraphrases. |
Conclusions and future work | First, we have created gold-standard implicit argument annotations for a small set of pervasive nominal predicates.7 Our analysis shows that these annotations add 65% to the role coverage of NomBank. |
Evaluation | To factor out errors from standard SRL analyses, the model used gold-standard argument labels provided by PropBank and NomBank. |
Evaluation | We also evaluated an oracle model that made gold-standard predictions for candidates within the two-sentence prediction window. |
Implicit argument identification | Throughout our study, we used gold-standard discourse relations provided by the Penn Discourse TreeBank (Prasad et al., 2008). |