Experiments | Note that most previous work does not report (or need) a standard development set; hence, for tuning our features and its hyper-parameters, we randomly split the original training data into a training and development set with a 70/30 ratio (and then use the full original training set during testing). |
Experiments | 7Note that the development set is used only for ACE04, because for ACE05, and ACE05-ALL, we directly test using the features tuned on ACE04. |
Experiments | Table 2: Incremental results for the Web features on the ACE04 development set . |
Semantics via Web Features | As a real example from our development set , the co-occurrence count cm for the headword pair (leaden president) is 11383, while it is only 95 for the headword pair (voter president); after normalization and loglo, the values are -10.9 and -12.0, respectively. |
Semantics via Web Features | Also, we do not constrain the order of hl and hg because these patterns can hold for either direction of coreference.4 As a real example from our development set , the 012 count for the headword pair (leaden president) is 752, while for (voter president), it is 0. |
Semantics via Web Features | We chose the following three context types, based on performance on a development set: |
Experimental Results 5.1 Data Resources | This book contains eleven practice tests, and we used all the sentence completion questions in the first five tests as a development set , and all the questions in the last six tests as the test set. |
Experimental Results 5.1 Data Resources | To provide human benchmark performance, we asked six native speaking high school students and five graduate students to answer the questions on the development set . |
Experimental Results 5.1 Data Resources | For the LSA—LM, an interpolation weight of 0.1 was used for the LSA score, determined through optimization on the development set . |
Sentence Completion via Latent Semantic Analysis | In practice, a f of 1.2 was selected on the basis of development set results. |
Datasets | In other words, the development set includes documents from July 1995, July 1996, July 1997, etc. |
Datasets | The development set includes 7,300 from July of each year. |
Experiments and Results | The A factor in the joint classifier is optimized on the development set as described in Section 4.3. |
Experiments and Results | The features described in this paper were selected solely by studying performance on the development set . |
Learning Time Constraints | Figure 4: Development set accuracy and A values. |
Experiments | The regularization parameter A is tuned on the development set . |
Experiments | We run all three algorithms for multiple epochs and pick the best epoch based on development set performance. |
Experiments | For the first set of experiments, we use the same division of the corpus as in (Livescu and Glass, 2004; Jyothi et al., 2011) into a 2492—word training set, a 165-word development set , and a 236-word test set. |
Abstract | Following most previous work, e. g. (Collins, 2002) and (Shen et al., 2007), we divide this corpus into training set (sections 0-18), development set (sections 19-21) and the final test set (sections 22-24). |
Abstract | Following (J iang et al., 2008a), we divide this corpus into training set (chapters 1-260), development set (chapters 271-300) and the final test set (chapters 301-325). |
Abstract | Experiments in this section are carried out on the development set . |
Experiments | Figure 4 shows the UAS curves on the development set , where K is beam size for Intersect and K-best for Rescoring, the X-aXis represents K, and the Y—aXis represents the UAS scores. |
Experiments | Table 3: The parsing times on the development set (seconds for all the sentences) |
Experiments | Table 3 shows the parsing times of Intersect on the development set for English. |
Implementation Details | The numbers, 10% and 30%, are tuned on the development sets in the experiments. |
Abstract | We present eXperiments on learning on 1.5 million training sentences, and show significant improvements over tuning discriminative models on small development sets . |
Experiments | The results on the news-commentary (nc) data show that training on the development set does not benefit from adding large feature sets — BLEU result differences between tuning 12 default features |
Experiments | However, scaling all features to the full training set shows significant improvements for algorithm 3, and especially for algorithm 4, which gains 0.8 BLEU points over tuning 12 features on the development set . |
Introduction | Our resulting models are learned on large data sets, but they are small and outperform models that tune feature sets of various sizes on small development sets . |
Constituent Recombination | The parameters )V; and p are tuned by the Powell’s method (Powell, 1964) on a development set , using the F1 score of PARSEVAL (Black et al., 1991) as objective. |
Experiment | For parser combination, we follow the setting of Fossum and Knight (2009), using Section 24 instead of Section 22 of WSJ treebank as development set . |
Experiment | It is tuned on a development set using the gold sec- |
Experiments | Table 1: Intrinsic evaluation accuracy [‘70] ( development set ) for Arabic segmentation and tagging. |
Experiments | 1 shows development set accuracy for two settings. |
Experiments | We tuned the feature weights on a development set using lattice-based minimum error rate training (MERT) (Macherey et al., |
Experiment | We estimated the optimal values of the stopping probabilities s by using the development set . |
Experiment | In all our experiments, we conducted ten independent runs to train our model, and selected the one that performed best on the development set in terms of parsing accuracy. |
Experiment | development set (S 100). |