Experiments | These languages are selected because they contain non-projective trees and are publicly available from the CoNLL-X webpage.6 Since the CoNLL-X data we have does not come with development sets , the last 10% of each training set is used for development. |
Experiments | Feature selection is done on the English development set . |
Experiments | First, all parameters are tuned on the English development set by using grid search on T = [1, . |
Selectional branching | Input: Dt: training set, Dd: development set . |
Selectional branching | First, an initial model M0 is trained on all data by taking the one-best sequences, and its score is measured by testing on a development set (lines 2-4). |
Introduction | If there is a mismatch between the domain of the development set and the test set, domain adaptation can potentially harm performance compared to an unadapted baseline. |
Translation Model Architecture | As a way of optimizing instance weights, (Sennrich, 2012b) minimize translation model perplexity on a set of phrase pairs, automatically extracted from a parallel development set . |
Translation Model Architecture | Cluster a development set into k clusters. |
Translation Model Architecture | 4.1 Clustering the Development Set |
Experimental Setup | We optimized this value on the development set and obtained best results with 5 = 0. |
Results | We optimized the model parameters on a development set consisting of cue-associate pairs from Nelson et al. |
Results | The best performing model on the development set used 500 visual terms and 750 topics and the association measure proposed in Griffiths et al. |
The Attribute Dataset | For most concepts the development set contained a maximum of 100 images and the test set a maximum of 200 images. |
The Attribute Dataset | Concepts with less than 800 images in total were split into 1/8 test and development set each, and 3/4 training set. |
The Attribute Dataset | The development set was used for devising and refining our attribute annotation scheme. |
CRF and features | For this purpose the development set was split into training and testing part. |
Evaluation | The performed evaluation assumed training of the CRF on the whole development set annotated with the induced transformations and then applying the trained model to tag the evaluation part with transformations. |
Evaluation | Observation of the development set suggests that returning the original inflected NPs may be a better baseline. |
Preparation of training data | The whole set was divided randomly into the development set (1105 NPs) and evaluation set (564 NPs). |
Preparation of training data | The development set was enhanced with word-level transformations that were induced automatically in the following manner. |
Preparation of training data | The frequencies of all transformations induced from the development set are given in Tab. |
Related works | For development and evaluation, two subsets of NCP were chosen and manually annotated with NP lemmas: development set (112 phrases) and evaluation set (224 phrases). |
Experiments | Table 5: Experimental results on the English and Chinese development sets with the padding technique and new supervised features added incrementally. |
Experiments | Table 6: Experimental results on the English and Chinese development sets with different types of semi-supervised features added incrementally to the extended parser. |
Experiments | on the development sets . |
Experiment | The development set and test set come from the NIST evaluation test data (from 2003 to 2005). |
Experiment | Finally, the development set includes 595 sentences from NIST MT03 and the test set contains 1,786 sentences from NIST MT04 and MT05. |
Experiment | We perform SRL on the source part of the training set, development set and test set by the Chinese SRL system used in (Zhuang and Zong, 2010b). |
Building the Resource | To this end, we constructed a development set comprised of a sample of 1,000 derivational families induced using our rules. |
Building the Resource | We also estimated the reliability of derivational rules by analyzing the accuracy of each rule on the development set . |
Evaluation | We have considered a number of string distance measures and tested them on the development set (cf. |
Evaluation | This is based on preliminary experiments on the development set (cf. |
Evaluation | Lemmas included in the development set (Section 4.1) were excluded from sampling. |
Experimental setup | We separated this corpus into three non-overlapping sets: a training set of 500 programs for parameter estimation in topic modeling and LE, a development set of 133 programs for empirical tuning and a test set of 400 programs for performance evaluation. |
Experimental setup | A number of parameters were set through empirical tuning on the developent set . |
Experimental setup | Figure 1 shows the results on the development set and the test set. |
Introduction | MERT (Och, 2003), MIRA (Watanabe et al., 2007; Chiang et al., 2008), PRO (Hopkins and May, 2011) and so on, which itera-tively optimize a weight such that, after re-ranking a k-best list of a given development set with this weight, the loss of the resulting l-best list is minimal. |
Introduction | where f is a source sentence in a given development set , and ((6*, d*), (6’, d’ is a preference pair for f; N is the number of all preference pairs; A > 0 is a regularizer. |
Introduction | Given a development set , we first run pre-training to obtain an initial parameter 61 for Algorithm 1 in line 1. |
Experiments | For comparison, we used the same test set with 40 newswire articles (672 sentences) as in (J i and Grishman, 2008; Liao and Grishman, 2010) for the experiments, and randomly selected 30 other documents (863 sentences) from different genres as the development set . |
Experiments | We use the harmonic mean of the trigger’s F1 measure and argument’s F1 measure to measure the performance on the development set . |
Experiments | Figure 6 shows the training curves of the averaged perceptron with respect to the performance on the development set when the beam size is 4. |
Introduction | Adding dependency language model (“depLM”) and the maximum entropy shift-reduce parsing model (“maxent”) significantly improves BLEU and TER on the development set , both separately and jointly. |
Introduction | We used the 2002 NIST MT Chinese-English dataset as the development set and the 2003-2005 NIST datasets as the testsets. |
Introduction | BLEU and TER scores are calculated on the development set . |
Experiments | Figure 3: Learning curve of the averaged perceptron classifier on the CTB developing set . |
Experiments | We train the baseline perceptron classifier for word segmentation on the training set of CTB 5.0, using the developing set to determine the best training iterations. |
Experiments | Figure 3 shows the learning curve of the averaged perceptron on the developing set . |
Experiments | We used the development set for initial development and tuning hyperparameters. |
Experiments | For the GUSP system, we set the hyperparame-ters from initial experiments on the development set , and used them in all subsequent experiments. |
Grounded Unsupervised Semantic Parsing | In preliminary experiments on the development set , we found that the naive model (with multinomials as conditional probabilities) did not perform well in EM. |
Experimentation | Besides, we reserve 33 documents in the training set as the development set and use the ground truth entities, times and values for our training and testing. |
Experimentation | Our statistics on the development set shows almost 65% of the event mentions are involved in those Correfrence, Parallel and Sequence relations, which occupy 63%, 50%, 9% respectively6. |
Inferring Inter-Sentence Arguments on Relevant Event Mentions | development set ; tri and tri ’ are triggers of kth and k’th event mention whose event types are et and et ’ in S<,~,j> and S<,~,,~> respectively. |
Experimental Assessment | We use sections 2-21 for training, 22 as development set , and 23 as test set. |
Experimental Assessment | We train all parsers up to 30 iterations, and for each parser we select the weight vector (3 from the iteration with the best accuracy on the development set . |
Experimental Assessment | We have computed the average value of 7 on our English data set, resulting in 2.98 (variance 2.15) for training set, and 2.95 (variance 1.96) for development set . |
Language Model Evaluation: Lexical Simplification | The data set contains a development set of 300 examples and a test set of 1710 examples.3 For our experiments, we evaluated the models on the test set. |
Language Model Evaluation: Lexical Simplification | The best lambda was chosen based on a linear search optimized on the SemEval 2012 development set . |
Why Does Unsimplified Data Help? | For the simplification task, the optimal lambda value determined on the development set was 0.98, with a very strong bias towards the simple model. |
RSP: A Random Walk Model for SP | We split the test set equally into two parts: one as the development set and the other as the final test set. |
RSP: A Random Walk Model for SP | Parameters Tuning: The parameters are tuned on the PTB development set , using AFP as the generalization data. |
RSP: A Random Walk Model for SP | This experiment is conducted on the PTB development set with RND confounders. |
Experiment | We use the first 1,419 queries together with their annotated documents as the development set to tune paraphrasing parameters (as we discussed in Section 2.3), and use the rest as the test set. |
Experiment | The ranking model is trained based on the development set . |
Paraphrasing for Web Search | {Qi,D{4abel}f=1 is a human-labeled development set . |
Experiments | We randomly selected a development set and a test set, and then the remaining sentence pairs are for training set. |
Experiments | Furthermore, development set and test set are divided into various intervals according to their best fuzzy match scores. |
Experiments | All the feature weights and the weight for each probability factor (3 factors for Model-III) are tuned on the development set with minimum-error-rate training (MERT) (Och, 2003). |
Experiments | We set aside 132 documents as a development set and use 350 documents as the evaluation set. |
Experiments | We used L2-regu1arization; the regularization parameter was tuned using the development set . |
Experiments | The parameter A was tuned using the development set . |
Experiment | The development sets are mainly used to tune the values of the weight factor 04 in Equation 5. |
Experiment | We evaluated the performance (F-score) of our model on the three development sets by using different 04 values, where 04 is progressively increased in steps of 0.1 (0 < 04 < 1.0). |
Experiment | 1The “baseline” uses a different training configuration so that the oz values in the decoding are also need to be tuned on the development sets . |
Experiments | We used minimum error rate training (Och, 2003) to tune the feature weights to maximise the BLEU score on the development set . |
Experiments | For the development set we use both ASR devset l and 2 from IWSLT 2005, and |
Experiments | For the development set we use the NIST 2002 test set, and evaluate performance on the test sets from NIST 2003 |
Experiments | We tune the parameters on a small development set of 50 questions. |
Experiments | This development set is also extracted from Yahoo! |
Experiments | For parameter K, we do an experiment on the development set to determine the optimal values among 50, 100, 150, - - - , 300 in terms of MAP. |
Experiments | We used as a development set ten additional documents from the Old Bailey proceedings and five additional documents from Trove that were not part of our test set. |
Results and Analysis | This slightly improves performance on our development set and can be thought of as placing a prior on the glyph shape parameters. |
Results and Analysis | We performed error analysis on our development set by randomly choosing 100 word errors from the WER alignment and manually annotating them with relevant features. |