Comparative Evaluation | First, we randomly sampled 100 terms from our gold standard for each domain and each of the three languages. |
Comparative Evaluation | Table 9: Number of domain glosses (from a random sample of 100 gold standard terms per domain) retrieved using Google Define and GlossBoot. |
Comparative Evaluation | As for the precision of the extracted terms, we randomly sampled 50% of them for each system. |
Experimental Setup | T0 calculate precision we randomly sampled 5% of the retrieved terms and asked two human annotators to manually tag their domain pertinence (with adjudication in case of disagreement; H = .62, indicating substantial agreement). |
Experimental Setup | Precision was determined on a random sample of 5% of the acquired glosses for each domain and language. |
Discussion and Future Work | We therefore focused on comparing the performance of our two-level scheme with state-of-the-art prior topic-level and word-level models of distributional similarity, over a random sample of inference rule applications. |
Experimental Settings | Rule applications were generated by randomly sampling extractions from ReVerb, such as ( ‘Jack’, ‘agree with’, ‘Jill ’) and then sampling possible rules for each, such as ‘agree with —> feel sorry for’. |
Introduction | In order to promote replicability and equal-term comparison with our results, we based our experiments on publicly available datasets, both for unsupervised learning of the evaluated models and for testing them over a random sample of rule applications. |
Results | However, our result suggests that topic-level models might not be robust enough when applied to a random sample of inferences. |
Empirical Evaluation | the Max-Ent parameters A, we randomly sampled 500 terms from the held-out data (10 threads in our corpus which were excluded from the evaluation of tasks in §6.2, §6.3) appearing at least 10 times and labeled them as topical (361) or AD-expressions (139) and used the corresponding features of each term (in the context of posts where it occurs, §3) to train the Max-Ent model. |
Empirical Evaluation | Instead, we randomly sampled 500 pairs (z 34% of the population) for evaluation. |
Phrase Ranking based on Relevance | To compute coverage, we randomly sampled 500 documents from the corpus and listed the candidate n-grams3 in the collection of sampled 500 documents. |
Phrase Ranking based on Relevance | We then computed the coverage to see how many of the relevant terms in the random sample were also present in top k phrases from the ranked candidate n-grams. |
Experiment 1: Oxford Lexical Predicates | We show in Table 3 the precision@l<: calculated over a random sample of 50 lexical predicates.11 As can be seen, while the classes quality is pretty high with low values of k, performance gradually degrades as we let k increase. |
Experiment 1: Oxford Lexical Predicates | Starting from the lexical predicate items obtained as described in Section 4.2, we selected those items belonging to a random sample of 20 usage notes among those provided by the Oxford dictionary, totaling 3,245 items. |
Experiment 1: Oxford Lexical Predicates | For tuning 04 we used a held-out set of 8 verbs, randomly sampled from the lexical predicates not used in the dataset. |
Conclusions | Moreover, our smoothing model, though unsupervised, provides reliable supervision when sufficiently random samples of words are available as nearby words. |
Discussion | Generally speaking, statistical reliability increases as the number of random sampling increases. |
Discussion | Therefore, we can conclude that if sufficiently random samples of nearby words are provided, our smoothing model is reliable, though it is trained in an unsupervised fashion. |