Related Work | Schein and Ungar observe that none of the 8 sampling methods investigated in their experiment achieved a significant improvement over the random sampling baseline on type b) errors. |
Related Work | In fact, entropy sampling and margin sampling even showed a decrease in performance compared to random sampling . |
Related Work | In the first setting, we randomly select new instances from the pool ( random sampling ; rand). |
Experimental setup | In order to speed up the experiments, a random sample of 2000 words was drawn from the pool and presented to the active learner each time. |
Results | The dashed curves in Figure 1 represent the baseline performance with no clustering, no context ordering, random sampling , and ALINE, unless otherwise noted. |
Results | For instance, on the Spanish dataset, random sampling reached 97% word accuracy after 1420 words had been annotated, whereas QBB did so with only 510 words — a 64% reduction in labelling effort. |
Results | It is important to note that empirical comparisons of different active learning techniques have shown that random sampling establishes a very |
Comparative Evaluation | First, we randomly sampled 100 terms from our gold standard for each domain and each of the three languages. |
Comparative Evaluation | Table 9: Number of domain glosses (from a random sample of 100 gold standard terms per domain) retrieved using Google Define and GlossBoot. |
Comparative Evaluation | As for the precision of the extracted terms, we randomly sampled 50% of them for each system. |
Experimental Setup | T0 calculate precision we randomly sampled 5% of the retrieved terms and asked two human annotators to manually tag their domain pertinence (with adjudication in case of disagreement; H = .62, indicating substantial agreement). |
Experimental Setup | Precision was determined on a random sample of 5% of the acquired glosses for each domain and language. |
Training | To reduce computation, we employ NCE, which uses randomly sampled sentences from all target language sentences in Q as e‘, and calculate the expected values by a beam search with beam width W to truncate alignments with low scores. |
Training | where 6+ is a target language sentence aligned to f+ in the training data, i.e., (f+, 6+) 6 T, e‘ is a randomly sampled pseudo-target language sentence with length |e+|, and N denotes the number of pseudo-target language sentences per source sentence f+. |
Training | In a simple implementation, each 6— is generated by repeating a random sampling from a set of target words (V6) |e+| times and lining them up sequentially. |
ConceptResolver | The regularization parameter A is automatically selected on each iteration by searching for a value which maximizes the loglikelihood of a validation set, which is constructed by randomly sampling 25% of L on each iteration. |
Evaluation | Resolver precision can be interpreted as the probability that a randomly sampled sense (in a cluster with at least 2 senses) is in a cluster representing its true meaning. |
Evaluation | To create this set, we randomly sampled noun phrases from each category and manually matched each noun phrase to one or more real-world entities. |
Evaluation | To make this difference concrete, Figure 2 (first page) shows a random sample of 10 concepts from both company and athlete. |
Introduction | Figure 2: A random sample of concepts created by ConceptResolver. |
Evaluation | The collection of queries is a random sample of fully-anonymized queries in English submitted by Web users in 2006. |
Evaluation | To test whether that is the case, a random sample of 200 class labels, out of the 2,614 labels found to be potentially-useful specific concepts, are manually annotated as correct, subjectively correct or incorrect, as shown in Table 2. |
Evaluation | Rather than inspecting a random sample of classes, the evaluation validates the results against a reference set of 40 gold-standard classes that were manually assembled as part of previous work (Pasca, 2007). |
Conclusions and Future Work | This may be an effect of ‘sparseness’ of relevant user data, in that users talk about politics very sporadically compared to a random sample of their neighbors. |
Identifying Twitter Social Graph | In the Fall of 2012, leading up to the elections, we randomly sampled n = 516 Democratic and m = 515 Republican users. |
Identifying Twitter Social Graph | For each such user we collect recent tweets and randomly sample their immediate k = 10 neighbors from follower, friend, user mention, reply, retweet and hashtag social circles. |
Identifying Twitter Social Graph | Similar to the candidate-centric graph, for each user we collect recent tweets and randomly sample user social circles in the Fall of 2012. |
Experiments | For the test data, we randomly sampled 23,650 examples of (event causality candidate, original sentence) among which 3,645 were positive from 2,451,254 event causality candidates extracted from our web corpus (Section 3.1). |
Experiments | Note that, for the diversity of the sampled scenarios, our sampling proceeded as follows: (i) Randomly sample a beginning event phrase from the generated scenarios. |
Experiments | (ii) Randomly sample an effect phrase for the beginning event phrase from the scenarios. |
Empirical Evaluation | the Max-Ent parameters A, we randomly sampled 500 terms from the held-out data (10 threads in our corpus which were excluded from the evaluation of tasks in §6.2, §6.3) appearing at least 10 times and labeled them as topical (361) or AD-expressions (139) and used the corresponding features of each term (in the context of posts where it occurs, §3) to train the Max-Ent model. |
Empirical Evaluation | Instead, we randomly sampled 500 pairs (z 34% of the population) for evaluation. |
Phrase Ranking based on Relevance | To compute coverage, we randomly sampled 500 documents from the corpus and listed the candidate n-grams3 in the collection of sampled 500 documents. |
Phrase Ranking based on Relevance | We then computed the coverage to see how many of the relevant terms in the random sample were also present in top k phrases from the ranked candidate n-grams. |
Discussion and Future Work | We therefore focused on comparing the performance of our two-level scheme with state-of-the-art prior topic-level and word-level models of distributional similarity, over a random sample of inference rule applications. |
Experimental Settings | Rule applications were generated by randomly sampling extractions from ReVerb, such as ( ‘Jack’, ‘agree with’, ‘Jill ’) and then sampling possible rules for each, such as ‘agree with —> feel sorry for’. |
Introduction | In order to promote replicability and equal-term comparison with our results, we based our experiments on publicly available datasets, both for unsupervised learning of the evaluated models and for testing them over a random sample of rule applications. |
Results | However, our result suggests that topic-level models might not be robust enough when applied to a random sample of inferences. |
Experimental Results | Table 3 lists query-product associations for five randomly sampled products along with their model scores from Pmle Pintp. |
Experimental Results | We created two samples from the TEST dataset: one randomly sampled by taking click weights into account, and the other sampled uniformly at random. |
Experimental Results | Table 3: Example query-product association scores for a random sample of five products. |
Experiments | For each of the 500 observed tuples in the test-set we generated a pseudo-negative tuple by randomly sampling two noun phrases from the distribution of NPs in both corpora. |
Experiments | (2007) we randomly sampled 100 inference rules. |
Experiments | We randomly sampled 300 of these inferences to hand-label. |
Empirical Evaluation | Ideally, we would like to evaluate a random sample of the more than 1,000 languages represented in PANDICTIONARY.5 However, a high-quality evaluation of translation between two languages requires a person who is fluent in both languages. |
Empirical Evaluation | We provided our evaluators with a random sample of translations into their native language. |
Empirical Evaluation | To carry out this comparison, we randomly sampled 1,000 senses from English Wiktionary and ran the three algorithms over them. |
Introduction | In a random sample of recipe reviews from allrecipes.c0m, we found that 57.8% contain refinements of the original recipe. |
Introduction | We created a new recipe data set, and manually labeled a random sample to evaluate our model and several baselines. |
Models | In a manually labeled random sample of recipe reviews, we find that refinement segments tend to be clustered together in certain reviews (“bursty”), rather than uniformly distributed across all reviews. |
Experiments | We use the Twigg SDK 7 to crawl all tweets from April 20th 2010 to April 25th 2010, then drop non-English tweets and get about 11,371,389, from which 15,800 tweets are randomly sampled , and are then labeled by two independent annotators, so that the beginning and the end of each named entity are marked with <TYPE> and </TYPE>, respectively. |
Task Definition | This is based on an investigation of 12,245 randomly sampled tweets, which are manually labeled. |
Task Definition | According to our investigation on 12,245 randomly sampled tweets that are manually labeled, about 46.8% have at least one named entity. |
Experiment 1: Oxford Lexical Predicates | We show in Table 3 the precision@l<: calculated over a random sample of 50 lexical predicates.11 As can be seen, while the classes quality is pretty high with low values of k, performance gradually degrades as we let k increase. |
Experiment 1: Oxford Lexical Predicates | Starting from the lexical predicate items obtained as described in Section 4.2, we selected those items belonging to a random sample of 20 usage notes among those provided by the Oxford dictionary, totaling 3,245 items. |
Experiment 1: Oxford Lexical Predicates | For tuning 04 we used a held-out set of 8 verbs, randomly sampled from the lexical predicates not used in the dataset. |
Conclusion and Future Work | The experimental results show that verb and verb phrase classification method is reasonably accurate with 91% precision and 78% recall with manually constructed gold standard consisting of 80 verbs and 82% accuracy for a random sample of all the WordNet entries. |
Experience Detection | We randomly sampled l,000 sentences4 and asked three annotators to judge whether or not individual sentences are considered containing an experience based on our definition. |
Lexicon Construction | We randomly sampled 200 items and examined how accurately the classification was done. |
Conclusions | Moreover, our smoothing model, though unsupervised, provides reliable supervision when sufficiently random samples of words are available as nearby words. |
Discussion | Generally speaking, statistical reliability increases as the number of random sampling increases. |
Discussion | Therefore, we can conclude that if sufficiently random samples of nearby words are provided, our smoothing model is reliable, though it is trained in an unsupervised fashion. |
Phase 1: Inducing the Page Taxonomy | Taxonomy quality To evaluate the quality of our page taxonomy we randomly sampled 1,000 Wikipedia pages. |
Phase 1: Inducing the Page Taxonomy | It was established by selecting the combination, among all possible permutations, which maximized precision on a tuning set of 100 randomly sampled pages, disjoint from our page dataset. |
Phase 3: Category taxonomy refinement | Category taxonomy quality To estimate the quality of the category taxonomy, we randomly sampled 1,000 categories and, for each of them, we manually associated the super-categories which were deemed to be appropriate hypemyms. |
Seed diversity | We randomly sample Sgold from two sets of correct terms extracted from the evaluation cache. |
Unsupervised bagging | One approach is to use uniform random sampling from restricted sections of Lhand. |
Unsupervised bagging | We performed random sampling from the top 100, 200 and 500 terms of Lhand. |
Conclusions | Through manual evaluation we found that the algorithm could correctly identify 60.4% birth cases from a set of 48 random samples and 57% split/join cases from a set of 21 randomly picked samples. |
Evaluation framework | We selected 48 random samples of candidate words for birth cases and 21 random samples for split/join cases. |
Evaluation framework | A further analysis of the words marked due to birth in the random samples indicates that there are 22 technology-related words, 2 slangs, 3 economics related words and 2 general words. |
Data preparation | The parallel corpus is randomly sampled into two large and equally-sized parts. |
Data preparation | The final test set is created by randomly sampling the desired number of test instances. |
Experiments & Results | The final test sets are a randomly sampled 5, 000 sentence pairs from the 200, 000-sentence test split for each language pair. |
Experimental Results | We estimate the quality of paraphrases by annotating a random sample as correct/incorrect and calculating the accuracy. |
Experimental Results | We estimate the precision (P) of the extracted instances by annotating a random sample of instances as correct/incorrect. |
Experimental Results | We randomly sampled 50 instances of the “acquisition” and “birthplace” relations from the system and the baseline outputs. |