Related Work | The training and development sets were completely in full to task participants. |
Related Work | However, we were unable to download all the training and development sets because some tweets were deleted or not available due to modified authorization status. |
Related Work | The tradeoff parameter of ReEmb (Labutov and Lipson, 2013) is tuned on the development set of SemEval 2013. |
Experimental Setup | For efficiency, we limit the sentence length to 70 tokens in training and development sets . |
Experimental Setup | This gives a 99% pruning recall on the CATiB development set . |
Experimental Setup | After pruning, we tune the regularization parameter 0 = {0.l,0.01,0.001} on development sets for different languages. |
Sampling-Based Dependency Parsing with Global Features | 2In our work we choose oz 2 0.003, which gives a 98.9% oracle POS tagging accuracy on the CATiB development set . |
Experiments | For this experiment, all models were estimated from the training set and evaluated on the development set . |
Experiments | For test set evaluations, we trained on the combination of the training and development sets (§2), to maximize the amount of training data for the final experiments. |
Experiments | 12We selected a threshold for binarization from a grid of 1001 points from 1 to 4 that maximized the accuracy of binarized predictions from a model trained on the training set and evaluated on the binarized development set . |
System Description | On 10 preliminary runs with the development set , this variance |
System Description | Table l: Pearson’s 7“ on the development set , for our full system and variations excluding each feature type. |
Experiment | We conducted experiments on the Penn Chinese Treebank (CTB) version 5.1 (Xue et al., 2005): Articles 001-270 and 400-1151 were used as the training set, Articles 301-325 were used as the development set , and Articles 271-300 were used |
Experiment | We tuned the optimal number of iterations of perceptron training algorithm on the development set . |
Experiment | We trained these three systems on the training set and evaluated them on the development set . |
Empirical evaluation | We tuned the L1 regularization strength, developed features, and ran analysis experiments on the development set (averaging across random splits). |
Empirical evaluation | To further examine this, we ran BCFL13 on the development set , allowing it to use only predicates from logical forms suggested by our logical form construction step. |
Empirical evaluation | This improved oracle accuracy on the development set to 64.5%, but accuracy was 32.2%. |
Results | the English development set as a function of number of training iterations with two different beam sizes, 20 and 100, over the local and nonlocal feature sets. |
Results | In Figure 4 we compare early update with LaSO and delayed LaSO on the English development set . |
Results | Table 1 displays the differences in F-measures and CoNLL average between the local and nonlocal systems when applied to the development sets for each language. |
Dependency-based Pre-ordering Rule Set | 4 Conduct primary experiments which used the same training set and development set as the experiments described in Section 3. |
Dependency-based Pre-ordering Rule Set | In the primary experiments, we tested the effectiveness of the candidate rules and filtered the ones that did not work based on the BLEU scores on the development set . |
Experiments | Our development set was the official NIST MT evaluation data from 2002 to 2005, consisting of 4476 Chinese-English sentences pairs. |
Experiments | lected from the development set . |
Experiments | The evaluation set contained 200 sentences randomly selected from the development set . |
Error Analysis | We sampled 100 errors randomly from all errors made by our final model (trained on all three datasets with domain adaptation and additional features) on the ARZ development set ; see Table 4. |
Error Analysis | Table 4: Counts of error categories (out of 100 randomly sampled ARZ development set errors). |
Error Analysis | One example of this distinction that appeared in the development set is the pair any)» mawdm“‘my topic” (yo madeZ< + 6. |
Experiments | F1 scores provide a more informative assessment of performance than word-level or character-level accuracy scores, as over 80% of tokens in the development sets consist of only one segment, with an average of one segmentation every 4.7 tokens (or one every 20.4 characters). |
Experiments | Table 1 contains results on the development set for the model of Green and DeNero and our improvements. |
Regularization Improves Topic Models | We split each dataset into a training fold (70%), development fold (15%), and a test fold (15%): the training data are used to fit models; the development set are used to select parameters (anchor threshold M, document prior 04, regularization weight A); and final results are reported on the test fold. |
Regularization Improves Topic Models | We select 04 using grid search on the development set . |
Regularization Improves Topic Models | 4.1 Grid Search for Parameters on Development Set |
Annotations | Table 2: Results for the Penn Treebank development set , sentences of length g 40, for different annotation schemes implemented on top of the X-bar grammar. |
Features | Table 1 shows the results of incrementally building up our feature set on the Penn Treebank development set . |
Other Languages | (2013) only report results on the development set for the Berkeley-Rep model; however, the task organizers also use a version of the Berkeley parser provided with parts of speech from high-quality POS taggers for each language (Berkeley-Tags). |
Other Languages | On the development set , we outperform the Berkeley parser and match the performance of the Berkeley-Rep parser. |
Experiments | While emails and weblogs are used as the development sets , reviews, news groups and Yahoo!Answers are used as the final test sets. |
Experiments | All these parameters are selected according to the averaged accuracy on the development set . |
Experiments | Experimental results under the 4 combined settings on the development sets are illustrated in Figure 2, 3 and 4, where the |
Citation Extraction Data | There are 660 citations in the development set and 367 citation in the test set. |
Citation Extraction Data | We then use the development set to learn the penalties for the soft constraints, using the perceptron algorithm described in section 3.1. |
Citation Extraction Data | We instantiate constraints from each template in section 5.1, iterating over all possible labels that contain a B prefix at any level in the hierarchy and pruning all constraints with imp(c) < 2.75 calculated on the development set . |
Soft Constraints in Dual Decomposition | We found it beneficial, though it is not theoretically necessary, to learn the constraints on a held-out development set , separately from the other model parameters, as during training most constraints are satisfied due to overfitting, which leads to an underestimation of the relevant penalties. |
Experimental Setup | For parameters estimation, we tune all parameters (utterance selection and path ranking) ex-haustively with 0.1 intervals using our development set . |
Phrasal Query Abstraction Framework | The parameters a and fl are tuned on a development set and sum up to 1. |
Phrasal Query Abstraction Framework | We estimate the percentage of the retrieved utterances based on the development set . |
Experiments | The beam size was tuned on the development set , and a value of 128 was found to achieve a reasonable balance of accuracy and speed; hence this value was used for all experiments. |
Experiments | dependency length on the development set . |
Experiments | Table 1 shows the accuracies of all parsers on the development set , in terms of labeled precision and recall over the predicate-argument dependencies in CCGBank. |
Experiments | We used the NIST MT03 evaluation test data as our development set , and the NIST MT05 as the test set. |
Experiments | Table 4: Experiment results of the sense-based translation model (STM) with lexicon and sense features extracted from a window of size varying from $5 to $15 words on the development set . |
Experiments | Our first group of experiments were conducted to investigate the impact of the window size k on translation performance in terms of BLEUMIST on the development set . |
Evaluation | We sampled data from the training and development set of the Persian dependency treebank (Rasooli et al., 2013) to create a comparable seventh dataset in Persian. |
Evaluation | 00 is the upper-bound OOV reduction for our expansion model: for each word in the development set , we ask if our model, without any vocabulary size restriction at all, could generate it. |
Evaluation | Table 5: Results from running a handcrafted Turkish morphological analyzer (Oflazer, 1996) on different expansions and on the development set . |
Expected BLEU Training | 1We tuned AM+1 on the development set but found that AM+1 = 1 resulted in faster training and equal accuracy. |
Expected BLEU Training | We fix 6 and re-optimize A in the presence of the recurrent neural network model using Minimum Error Rate Training (Och, 2003) on the development set (§5). |
Experiments | ther lattices or the unique 100-best output of the phrase-based decoder and reestimate the log-linear weights by running a further iteration of MERT on the n-best list of the development set , augmented by scores corresponding to the neural network models. |