Experiments | These languages are selected because they contain non-projective trees and are publicly available from the CoNLL-X webpage.6 Since the CoNLL-X data we have does not come with development sets , the last 10% of each training set is used for development. |
Experiments | Feature selection is done on the English development set . |
Experiments | First, all parameters are tuned on the English development set by using grid search on T = [1, . |
Selectional branching | Input: Dt: training set, Dd: development set . |
Selectional branching | First, an initial model M0 is trained on all data by taking the one-best sequences, and its score is measured by testing on a development set (lines 2-4). |
Experiments of Grammar Formalism Conversion | Table 4: Results of the generative parser on the development set , when trained with various weighting of CTB training set and CDTPS . |
Experiments of Parsing | We used a standard split of CTB for performance evaluation, articles 1-270 and 400-1151 as training set, articles 301-325 as development set , and articles 271-300 as test set. |
Experiments of Parsing | We tried the corpus weighting method when combining CDTPS with CTB training set (abbreviated as CTB for simplicity) as training data, by gradually increasing the weight (including 1, 2, 5, 10, 20, 50) of CTB to optimize parsing performance on the development set . |
Experiments of Parsing | Table 4 presents the results of the generative parser with various weights of CTB on the development set . |
Our Two-Step Solution | The number of removed trees will be determined by cross validation on development set . |
Our Two-Step Solution | The value of A will be tuned by cross validation on development set . |
Our Two-Step Solution | Corpus weighting is exactly such an approach, with the weight tuned on development set , that will be used for parsing on homogeneous treebanks in this paper. |
Introduction | If there is a mismatch between the domain of the development set and the test set, domain adaptation can potentially harm performance compared to an unadapted baseline. |
Translation Model Architecture | As a way of optimizing instance weights, (Sennrich, 2012b) minimize translation model perplexity on a set of phrase pairs, automatically extracted from a parallel development set . |
Translation Model Architecture | Cluster a development set into k clusters. |
Translation Model Architecture | 4.1 Clustering the Development Set |
Related Work | The training and development sets were completely in full to task participants. |
Related Work | However, we were unable to download all the training and development sets because some tweets were deleted or not available due to modified authorization status. |
Related Work | The tradeoff parameter of ReEmb (Labutov and Lipson, 2013) is tuned on the development set of SemEval 2013. |
Error Detection with a Maximum Entropy Model | To avoid overfitting, we optimize the Gaussian prior on the development set . |
Experiments | We find that the LG parser can not fully parse 560 sentences (63.8%) in the training set (MT-02), 731 sentences (67.6%) in the development set (MT-05) and 660 sentences (71.8%) in the test set (MT-03). |
Experiments | To compare with previous work using word posterior probabilities for confidence estimation, we carried out experiments using wpp estimated from N -best lists with the classification threshold 7', which was optimized on our development set to minimize CER. |
Features | 1This does not mean we do not need a development set . |
Features | We do validate our feature selection and other experimental settings on the development set . |
Features | We optimize the discrete factor on our development set and find the optimal value is 1. |
Introduction | 3) Divide words into two groups (correct translations and errors) by using a classification threshold optimized on a development set . |
Related Work | rectly output prediction results from our dis-criminatively trained classifier without optimizing a classification threshold on a distinct development set beforehand.1 Most previous approaches make decisions based on a pre-tuned classification threshold 7' as follows |
SMT System | For minimum error rate tuning (Och, 2003), we use NIST MT-02 as the development set for the translation task. |
Experiments | Note that most previous work does not report (or need) a standard development set; hence, for tuning our features and its hyper-parameters, we randomly split the original training data into a training and development set with a 70/30 ratio (and then use the full original training set during testing). |
Experiments | 7Note that the development set is used only for ACE04, because for ACE05, and ACE05-ALL, we directly test using the features tuned on ACE04. |
Experiments | Table 2: Incremental results for the Web features on the ACE04 development set . |
Semantics via Web Features | As a real example from our development set , the co-occurrence count cm for the headword pair (leaden president) is 11383, while it is only 95 for the headword pair (voter president); after normalization and loglo, the values are -10.9 and -12.0, respectively. |
Semantics via Web Features | Also, we do not constrain the order of hl and hg because these patterns can hold for either direction of coreference.4 As a real example from our development set , the 012 count for the headword pair (leaden president) is 752, while for (voter president), it is 0. |
Semantics via Web Features | We chose the following three context types, based on performance on a development set: |
Experimental Evaluation | The graphs were randomly split into a development set (11 graphs) and a test set (12 graphs)6. |
Experimental Evaluation | The left half depicts methods where the development set was needed to tune parameters, and the right half depicts methods that do not require a (manually created) development set at all. |
Experimental Evaluation | Results on the left were achieved by optimizing the top-K parameter on the development set , and on the right by optimizing on the training set automatically generated from WordNet. |
Learning Entailment Graph Edges | Note that this constant needs to be optimized on a development set . |
Learning Entailment Graph Edges | Importantly, while the score-based formulation contains a parameter A that requires optimization, this probabilistic formulation is parameter free and does not utilize a development set at all. |
Discussion | In this paper, we have described how MERT can be employed to estimate the weights for the linear loss function to maximize BLEU on a development set . |
Experiments | Our development set (dev) consists of the NIST 2005 eval set; we use this set for optimizing MBR parameters. |
Experiments | MERT is then performed to optimize the BLEU score on a development set ; For MERT, we use 40 random initial parameters as well as parameters computed using corpus based statistics (Tromble et al., 2008). |
Experiments | We select the MBR scaling factor (Tromble et al., 2008) based on the development set ; it is set to 0.1, 0.01, 0.5, 0.2, 0.5 and 1.0 for the aren-phrase, aren-hier, aren-samt, zhen-phrase zhen-hier and zhen-samt systems respectively. |
Introduction | We employ MERT to select these weights by optimizing BLEU score on a development set . |
Cluster Feature Selection | Our main idea is to learn the best set of prefix lengths, perhaps through the validation of their effectiveness on a development set of data. |
Cluster Feature Selection | Because this method does not need validation on the development set , it is the laziest but the fastest method for selecting clusters. |
Cluster Feature Selection | Exhaustive Search (ES): ES works by trying every possible combination of the set I and picking the one that works the best for the development set . |
Experiments | The set of prefix lengths that worked the best for the development set was chosen to select clusters. |
Experiments | It was interesting that ES did not always outperform the two statistical methods which might be because of its overfitting to the development set . |
Experiments | For the semi-supervised system, each test fold was the same one used in the baseline and the other 4 folds were further split into a training set and a development set in a ratio of 7:3 for selecting clusters. |
Beam-Width Prediction | Figure 3 is a visual representation of beam-width prediction on a single sentence of the development set using the Berkeley latent-variable grammar and Boundary FOM. |
Conclusion and Future Work | Table 1: Section 22 development set results for CYK and Beam-Search (Beam) parsing using the Berkeley latent-variable grammar. |
Open/Closed Cell Classification | We found the value A = 102 to give the best performance on our development set , and we use this value in all of our experiments. |
Open/Closed Cell Classification | Figures 2a and 2b compare the pruned charts of Chart Constraints and Constituent Closure for a single sentence in the development set . |
Open/Closed Cell Classification | We compare the effectiveness of Constituent Closure, Complete Closure, and Chart Constraints, by decreasing the percentage of chart cells closed until accuracy over all sentences in our development set start to decline. |
Results | In Table 1 we present the accuracy and parse time for three baseline parsers on the development set : exhaustive CYK parsing, beam-search parsing using only the inside score BC), and beam-search parsing using the Boundary FOM. |
Supervised evaluation tasks | training partition sentences, and evaluated their F1 on the development set . |
Supervised evaluation tasks | After each epoch over the training set, we measured the accuracy of the model on the development set . |
Supervised evaluation tasks | Training was stopped after the accuracy on the development set did not improve for 10 epochs, generally about 50—80 epochs total. |
CRF and features | For this purpose the development set was split into training and testing part. |
Evaluation | The performed evaluation assumed training of the CRF on the whole development set annotated with the induced transformations and then applying the trained model to tag the evaluation part with transformations. |
Evaluation | Observation of the development set suggests that returning the original inflected NPs may be a better baseline. |
Preparation of training data | The whole set was divided randomly into the development set (1105 NPs) and evaluation set (564 NPs). |
Preparation of training data | The development set was enhanced with word-level transformations that were induced automatically in the following manner. |
Preparation of training data | The frequencies of all transformations induced from the development set are given in Tab. |
Related works | For development and evaluation, two subsets of NCP were chosen and manually annotated with NP lemmas: development set (112 phrases) and evaluation set (224 phrases). |
Experimental Setup | We optimized this value on the development set and obtained best results with 5 = 0. |
Results | We optimized the model parameters on a development set consisting of cue-associate pairs from Nelson et al. |
Results | The best performing model on the development set used 500 visual terms and 750 topics and the association measure proposed in Griffiths et al. |
The Attribute Dataset | For most concepts the development set contained a maximum of 100 images and the test set a maximum of 200 images. |
The Attribute Dataset | Concepts with less than 800 images in total were split into 1/8 test and development set each, and 3/4 training set. |
The Attribute Dataset | The development set was used for devising and refining our attribute annotation scheme. |
Syllabification Experiments | Recall that 5K training examples were held out as a development set . |
Syllabification with Structured SVMs | We use a linear kernel, and tune the SVM’s cost parameter on a development set . |
Syllabification with Structured SVMs | While developing our tagging schemes and feature representation, we used a development set of 5K words held out from our CELEX training data. |
Syllabification with Structured SVMs | Our development set experiments suggested that numbering ONC tags increases their performance. |
Experiments | For this experiment, all models were estimated from the training set and evaluated on the development set . |
Experiments | For test set evaluations, we trained on the combination of the training and development sets (§2), to maximize the amount of training data for the final experiments. |
Experiments | 12We selected a threshold for binarization from a grid of 1001 points from 1 to 4 that maximized the accuracy of binarized predictions from a model trained on the training set and evaluated on the binarized development set . |
System Description | On 10 preliminary runs with the development set , this variance |
System Description | Table l: Pearson’s 7“ on the development set , for our full system and variations excluding each feature type. |
Experimental Setup | For efficiency, we limit the sentence length to 70 tokens in training and development sets . |
Experimental Setup | This gives a 99% pruning recall on the CATiB development set . |
Experimental Setup | After pruning, we tune the regularization parameter 0 = {0.l,0.01,0.001} on development sets for different languages. |
Sampling-Based Dependency Parsing with Global Features | 2In our work we choose oz 2 0.003, which gives a 98.9% oracle POS tagging accuracy on the CATiB development set . |
Experiment | We conducted experiments on the Penn Chinese Treebank (CTB) version 5.1 (Xue et al., 2005): Articles 001-270 and 400-1151 were used as the training set, Articles 301-325 were used as the development set , and Articles 271-300 were used |
Experiment | We tuned the optimal number of iterations of perceptron training algorithm on the development set . |
Experiment | We trained these three systems on the training set and evaluated them on the development set . |
Experiments on Parsing-Based SMT | The feature weights are tuned on the development set using the minimum error |
Experiments on Phrase-Based SMT | We used the NIST MT-2002 set as the development set and the NIST MT-2004 test set as the test set. |
Experiments on Phrase-Based SMT | And Koehn's implementation of minimum error rate training (Och, 2003) is used to tune the feature weights on the development set . |
Experiments on Word Alignment | (11), we also manually labeled a development set including 100 sentence pairs, in the same manner as the test set. |
Experiments on Word Alignment | By minimizing the AER on the development set , the interpolation coefficients of the collocation probabilities on CM-l and CM-2 were set to 0.1 and 0.9. |
Improving Phrase Table | For the phrase only including one word, we set a fixed collocation probability that is the average of the collocation probabilities of the sentences on a development set . |
Experimental Results 5.1 Data Resources | This book contains eleven practice tests, and we used all the sentence completion questions in the first five tests as a development set , and all the questions in the last six tests as the test set. |
Experimental Results 5.1 Data Resources | To provide human benchmark performance, we asked six native speaking high school students and five graduate students to answer the questions on the development set . |
Experimental Results 5.1 Data Resources | For the LSA—LM, an interpolation weight of 0.1 was used for the LSA score, determined through optimization on the development set . |
Sentence Completion via Latent Semantic Analysis | In practice, a f of 1.2 was selected on the basis of development set results. |
Experimental Setup | In our experiments, the development set contains 200 sentences and the test set contains 500 sentences, both of which are randomly selected from the human translations of 2008 NIST Open Machine Translation Evaluation: Chinese to English Task. |
Statistical Paraphrase Generation | = cdev(+7“)/cdev(7“), where cdev(7“) is the total number of unit replacements in the generated paraphrases on the development set . |
Statistical Paraphrase Generation | Replacement rate (rr): rr measures the paraphrase degree on the development set , i.e., the percentage of words that are paraphrased. |
Statistical Paraphrase Generation | We define rr as: 77 = wdev(7“)/wdev(s), where wdev(7“) is the total number of words in the replaced units on the development set, and wdev (s) is the number of words of all sentences on the development set . |
Introduction | MERT (Och, 2003), MIRA (Watanabe et al., 2007; Chiang et al., 2008), PRO (Hopkins and May, 2011) and so on, which itera-tively optimize a weight such that, after re-ranking a k-best list of a given development set with this weight, the loss of the resulting l-best list is minimal. |
Introduction | where f is a source sentence in a given development set , and ((6*, d*), (6’, d’ is a preference pair for f; N is the number of all preference pairs; A > 0 is a regularizer. |
Introduction | Given a development set , we first run pre-training to obtain an initial parameter 61 for Algorithm 1 in line 1. |
Experimental setup | We separated this corpus into three non-overlapping sets: a training set of 500 programs for parameter estimation in topic modeling and LE, a development set of 133 programs for empirical tuning and a test set of 400 programs for performance evaluation. |
Experimental setup | A number of parameters were set through empirical tuning on the developent set . |
Experimental setup | Figure 1 shows the results on the development set and the test set. |
Dependency-based Pre-ordering Rule Set | 4 Conduct primary experiments which used the same training set and development set as the experiments described in Section 3. |
Dependency-based Pre-ordering Rule Set | In the primary experiments, we tested the effectiveness of the candidate rules and filtered the ones that did not work based on the BLEU scores on the development set . |
Experiments | Our development set was the official NIST MT evaluation data from 2002 to 2005, consisting of 4476 Chinese-English sentences pairs. |
Experiments | lected from the development set . |
Experiments | The evaluation set contained 200 sentences randomly selected from the development set . |
Experiments | For comparison, we used the same test set with 40 newswire articles (672 sentences) as in (J i and Grishman, 2008; Liao and Grishman, 2010) for the experiments, and randomly selected 30 other documents (863 sentences) from different genres as the development set . |
Experiments | We use the harmonic mean of the trigger’s F1 measure and argument’s F1 measure to measure the performance on the development set . |
Experiments | Figure 6 shows the training curves of the averaged perceptron with respect to the performance on the development set when the beam size is 4. |
Building the Resource | To this end, we constructed a development set comprised of a sample of 1,000 derivational families induced using our rules. |
Building the Resource | We also estimated the reliability of derivational rules by analyzing the accuracy of each rule on the development set . |
Evaluation | We have considered a number of string distance measures and tested them on the development set (cf. |
Evaluation | This is based on preliminary experiments on the development set (cf. |
Evaluation | Lemmas included in the development set (Section 4.1) were excluded from sampling. |
Results | the English development set as a function of number of training iterations with two different beam sizes, 20 and 100, over the local and nonlocal feature sets. |
Results | In Figure 4 we compare early update with LaSO and delayed LaSO on the English development set . |
Results | Table 1 displays the differences in F-measures and CoNLL average between the local and nonlocal systems when applied to the development sets for each language. |
Wambaya grammar | In addition, this grammar has relatively low ambiguity, assigning on average 11.89 parses per item in the development set . |
Wambaya grammar | Of the 92 sentences in this text, 20 overlapped with items in the development set , so the |
Wambaya grammar | The parsed portion of the development set (732 items) constitutes a sufficiently large corpus to train a parse selection model using the Redwoods disambiguation technology (Toutanova et al., 2005). |
Abstract | Following most previous work, e. g. (Collins, 2002) and (Shen et al., 2007), we divide this corpus into training set (sections 0-18), development set (sections 19-21) and the final test set (sections 22-24). |
Abstract | Following (J iang et al., 2008a), we divide this corpus into training set (chapters 1-260), development set (chapters 271-300) and the final test set (chapters 301-325). |
Abstract | Experiments in this section are carried out on the development set . |
Error Analysis | We sampled 100 errors randomly from all errors made by our final model (trained on all three datasets with domain adaptation and additional features) on the ARZ development set ; see Table 4. |
Error Analysis | Table 4: Counts of error categories (out of 100 randomly sampled ARZ development set errors). |
Error Analysis | One example of this distinction that appeared in the development set is the pair any)» mawdm“‘my topic” (yo madeZ< + 6. |
Experiments | F1 scores provide a more informative assessment of performance than word-level or character-level accuracy scores, as over 80% of tokens in the development sets consist of only one segment, with an average of one segmentation every 4.7 tokens (or one every 20.4 characters). |
Experiments | Table 1 contains results on the development set for the model of Green and DeNero and our improvements. |
Experiments | The regularization parameter A is tuned on the development set . |
Experiments | We run all three algorithms for multiple epochs and pick the best epoch based on development set performance. |
Experiments | For the first set of experiments, we use the same division of the corpus as in (Livescu and Glass, 2004; Jyothi et al., 2011) into a 2492—word training set, a 165-word development set , and a 236-word test set. |
Regularization Improves Topic Models | We split each dataset into a training fold (70%), development fold (15%), and a test fold (15%): the training data are used to fit models; the development set are used to select parameters (anchor threshold M, document prior 04, regularization weight A); and final results are reported on the test fold. |
Regularization Improves Topic Models | We select 04 using grid search on the development set . |
Regularization Improves Topic Models | 4.1 Grid Search for Parameters on Development Set |
Datasets | In other words, the development set includes documents from July 1995, July 1996, July 1997, etc. |
Datasets | The development set includes 7,300 from July of each year. |
Experiments and Results | The A factor in the joint classifier is optimized on the development set as described in Section 4.3. |
Experiments and Results | The features described in this paper were selected solely by studying performance on the development set . |
Learning Time Constraints | Figure 4: Development set accuracy and A values. |
Empirical evaluation | We tuned the L1 regularization strength, developed features, and ran analysis experiments on the development set (averaging across random splits). |
Empirical evaluation | To further examine this, we ran BCFL13 on the development set , allowing it to use only predicates from logical forms suggested by our logical form construction step. |
Empirical evaluation | This improved oracle accuracy on the development set to 64.5%, but accuracy was 32.2%. |
Experiments | Table 5: Experimental results on the English and Chinese development sets with the padding technique and new supervised features added incrementally. |
Experiments | Table 6: Experimental results on the English and Chinese development sets with different types of semi-supervised features added incrementally to the extended parser. |
Experiments | on the development sets . |
Experiment | The development set and test set come from the NIST evaluation test data (from 2003 to 2005). |
Experiment | Finally, the development set includes 595 sentences from NIST MT03 and the test set contains 1,786 sentences from NIST MT04 and MT05. |
Experiment | We perform SRL on the source part of the training set, development set and test set by the Chinese SRL system used in (Zhuang and Zong, 2010b). |
Experimental Results and Analysis | We used 10 newswire texts from ACE 2005 training corpora (from March to May of 2003) as our development set , and then conduct blind test on a separate set of 40 ACE 2005 newswire texts. |
Experimental Results and Analysis | We select the thresholds (d with k=1~13) for various confidence metrics by optimizing the F-measure score of each rule on the development set , as shown in Figure 2 and 3 as follows. |
Experimental Results and Analysis | The labeled point on each curve shows the best F-measure that can be obtained on the development set by adjusting the threshold for that rule. |
Experiments | To empirically investigate the parameter A and the convergence of our algorithm aPLSA, we generated five more date sets as the development sets . |
Experiments | The detailed description of these five development sets , namely tunel to tune5 is listed in Table 1 as well. |
Experiments | We have done some experiments on the development sets to investigate how different A affect the performance of aPLSA. |
Experiments | We randomly chose 9 documents from the year 2001 for a development set , and 41 documents for testing. |
How Frequent is Unseen Data? | We then record every seen (vd, n) pair during training that is seen two or more times3 and then count the number of unseen pairs in the NYT development set (1455 tests). |
How Frequent is Unseen Data? | Figure 1: Percentage of NYT development set that is unseen when trained on varying amounts of data. |
How Frequent is Unseen Data? | Figure 2: Percentage of subject/object/preposition arguments in the NYT development set that is unseen when trained on varying amounts of NYT data. |
Discussion and Further Work | _ tences ( development set ) |
Evaluation | A comparison on the impact of accounting for all derivations in training and decoding ( development set ). |
Evaluation | The effect of the beam width (log-scale) on max-translation decoding ( development set ). |
Evaluation | An informal comparison of the outputs on the development set , presented in Table 4, suggests that the |
Citation Extraction Data | There are 660 citations in the development set and 367 citation in the test set. |
Citation Extraction Data | We then use the development set to learn the penalties for the soft constraints, using the perceptron algorithm described in section 3.1. |
Citation Extraction Data | We instantiate constraints from each template in section 5.1, iterating over all possible labels that contain a B prefix at any level in the hierarchy and pruning all constraints with imp(c) < 2.75 calculated on the development set . |
Soft Constraints in Dual Decomposition | We found it beneficial, though it is not theoretically necessary, to learn the constraints on a held-out development set , separately from the other model parameters, as during training most constraints are satisfied due to overfitting, which leads to an underestimation of the relevant penalties. |
Experiments | The data splitting convention of other two corpora, People’s Daily doesn’t reserve the development sets , so in the following experiments, we simply choose the model after 7 iterations when training on this corpus. |
Experiments | Table 4: Error analysis for Joint S&T on the developing set of CTB. |
Experiments | To obtain further information about what kind of errors be alleviated by annotation adaptation, we conduct an initial error analysis for Joint S&T on the developing set of CTB. |
Experiments | Note that the development set was only used for evaluating the trained model to obtain the optimal values of tunable parameters. |
Experiments | For the baseline policy, we varied 7“ in the range of [1, 5] and found that setting 7“ = 3 yielded the best performance on the development set for both the small and large training corpus experiments. |
Experiments | Optimal balances were selected using the development set . |
Policies for correct path selection | 4In our experiments, the optimal threshold value 7" is selected by evaluating the performance of joint word segmentation and POS tagging on the development set. |
Phrase-based machine translation | For each alignment model and decoding type we train Moses and use MERT optimization to tune its parameters on a development set . |
Phrase-based machine translation | For Hansards we randomly chose 1000 and 500 sentences from test 1 and test 2 to be testing and development sets respectively. |
Phrase-based machine translation | In principle, we would like to tune the threshold by optimizing BLEU score on a development set , but that is impractical for experiments with many pairs of languages. |
Word alignment results | Figure 7 shows an example of the same sentence, using the same model where in one case Viterbi decoding was used and in the other case Posterior decoding tuned to minimize AER on a development set |
Experiments | For Chinese-English-Spanish translation, we used the development set (devset3) released for the pivot task as the test set, which contains 506 source sentences, with 7 reference translations in English and Spanish. |
Experiments | To be capable of tuning parameters on our systems, we created a development set of 1,000 sentences taken from the training sets, with 3 reference translations in both English and Spanish. |
Experiments | This development set is also used to train the regression learning model. |
Introduction | Adding dependency language model (“depLM”) and the maximum entropy shift-reduce parsing model (“maxent”) significantly improves BLEU and TER on the development set , both separately and jointly. |
Introduction | We used the 2002 NIST MT Chinese-English dataset as the development set and the 2003-2005 NIST datasets as the testsets. |
Introduction | BLEU and TER scores are calculated on the development set . |
Annotations | Table 2: Results for the Penn Treebank development set , sentences of length g 40, for different annotation schemes implemented on top of the X-bar grammar. |
Features | Table 1 shows the results of incrementally building up our feature set on the Penn Treebank development set . |
Other Languages | (2013) only report results on the development set for the Berkeley-Rep model; however, the task organizers also use a version of the Berkeley parser provided with parts of speech from high-quality POS taggers for each language (Berkeley-Tags). |
Other Languages | On the development set , we outperform the Berkeley parser and match the performance of the Berkeley-Rep parser. |
BBC News Database | the kernel whose value is optimized on the development set . |
BBC News Database | where 0t is a smoothing parameter tuned on the development set , sa is the annotation for the latent variable 5 and sd its corresponding document. |
BBC News Database | where ,u is a smoothing parameter estimated on the development set , lama is a Boolean variable denoting whether w appears in the annotation sa, and Nw is the number of latent variables that contain w in their annotations. |
Experiments | Figure 3: Learning curve of the averaged perceptron classifier on the CTB developing set . |
Experiments | We train the baseline perceptron classifier for word segmentation on the training set of CTB 5.0, using the developing set to determine the best training iterations. |
Experiments | Figure 3 shows the learning curve of the averaged perceptron on the developing set . |
Experiments | While emails and weblogs are used as the development sets , reviews, news groups and Yahoo!Answers are used as the final test sets. |
Experiments | All these parameters are selected according to the averaged accuracy on the development set . |
Experiments | Experimental results under the 4 combined settings on the development sets are illustrated in Figure 2, 3 and 4, where the |
Boosting an MST Parser | The relative weight A is adjusted to maximize the performance on the development set , using an algorithm similar to minimum error-rate training (Och, 2003). |
Experiments | Figure 3: Performance curves of the word-pair classification model on the development sets of WSJ and CTB 5.0, with respect to a series of ratio 7“. |
Experiments | Figure 4: The performance curve of the word-pair classification model on the development set of CTB 5.0, with respect to a series of threshold (9. |
Experiments | Then, on each instance set we train a classifier and test it on the development set of CTB 5.0. |
Background | 1 The data set used for weight training is generally called development set or tuning set in the SMT field. |
Background | We see, first of all, that all the three systems are improved during iterations on the development set . |
Background | iteration number Figure 2: BLEU scores on the development set |
Abstract | We present eXperiments on learning on 1.5 million training sentences, and show significant improvements over tuning discriminative models on small development sets . |
Experiments | The results on the news-commentary (nc) data show that training on the development set does not benefit from adding large feature sets — BLEU result differences between tuning 12 default features |
Experiments | However, scaling all features to the full training set shows significant improvements for algorithm 3, and especially for algorithm 4, which gains 0.8 BLEU points over tuning 12 features on the development set . |
Introduction | Our resulting models are learned on large data sets, but they are small and outperform models that tune feature sets of various sizes on small development sets . |
Experiments | Figure 4 shows the UAS curves on the development set , where K is beam size for Intersect and K-best for Rescoring, the X-aXis represents K, and the Y—aXis represents the UAS scores. |
Experiments | Table 3: The parsing times on the development set (seconds for all the sentences) |
Experiments | Table 3 shows the parsing times of Intersect on the development set for English. |
Implementation Details | The numbers, 10% and 30%, are tuned on the development sets in the experiments. |
Introduction | There are 216 documents and 4126 original-permutation pairs in the training set, and 24 documents and 465 pairs in the development set . |
Introduction | Transition length, salience, and a regularization parameter are tuned on the development set . |
Introduction | We only report results using the setting of transition length g 4, and no salience threshold, because they give the best performance on the development set . |
Corpora and Parameters | We used part of the Russian corpus as a development set for determining the parameters. |
Corpora and Parameters | On our development set we have tested various parameter settings. |
Corpora and Parameters | In our experiments we have used the following values (again, determined using a development set ) for these parameters: F0: 1,000 words per million (wpm); FH: 100 wpm; FB: 1.2 wpm; N: 500 words; W: 5 words; L: 30%; S: 2/3; 04: 0.1. |
Argument Mapping Model | Given these features with gold standard parses, our argument mapping model can predict entire argument mappings with an accuracy rate of 87.96% on the test set, and 87.70% on the development set . |
Identification and Labeling Models | All classifiers were trained to 500 iterations of L-BFGS training — a quasi-Newton method from the numerical optimization literature (Liu and N o-cedal, 1989) — using Zhang Le’s maxent toolkit.2 To prevent overfitting we used Gaussian priors with global variances of l and 5 for the identifier and labeler, respectively.3 The Gaussian priors were determined empirically by testing on the development set . |
Identification and Labeling Models | 4The size of the window was determined experimentally on the development set — we use the same window sizes throughout. |
Experiments | We tune the parameters on a small development set of 50 questions. |
Experiments | This development set is also extracted from Yahoo! |
Experiments | For parameter K, we do an experiment on the development set to determine the optimal values among 50, 100, 150, - - - , 300 in terms of MAP. |
Experiments | The English experiments were performed on the Penn Treebank (Marcus et al., 1993), using a standard set of head-selection rules (Yamada and Matsumoto, 2003) to convert the phrase structure syntax of the Treebank to a dependency tree representation.6 We split the Treebank into a training set (Sections 2—21), a development set (Section 22), and several test sets (Sections 0,7 1, 23, and 24). |
Experiments | of iterations of perceptron training, we performed up to 30 iterations and chose the iteration which optimized accuracy on the development set . |
Experiments | Table 6: Parent-prediction accuracies of unlabeled Czech parsers on the PDT 1.0 development set . |
Co-training strategy for prosodic event detection | Development Set 20 1,356 2,275 f2b, f3b Labeled set L 5 347 573 m2b, m3b Unlabeled set U 1,027 77,207 129,305 m4b |
Conclusions | In our experiment, we used some labeled data as development set to estimate some parameters. |
Experiments and results | Among labeled data, 102 utterances of all f] a and m] 19 speakers are used for testing, 20 utterances randomly chosen from f2b, f3b, m2b, m3b, and m4b are used as development set to optimize parameters such as A and confidence level threshold, 5 utterances are used as the initial training set L, and the rest of the data is used as unlabeled set U, which has 1027 unlabeled utterances (we removed the human labels for co-training experiments). |
Expected BLEU Training | 1We tuned AM+1 on the development set but found that AM+1 = 1 resulted in faster training and equal accuracy. |
Expected BLEU Training | We fix 6 and re-optimize A in the presence of the recurrent neural network model using Minimum Error Rate Training (Och, 2003) on the development set (§5). |
Experiments | ther lattices or the unique 100-best output of the phrase-based decoder and reestimate the log-linear weights by running a further iteration of MERT on the n-best list of the development set , augmented by scores corresponding to the neural network models. |
Experiments | The beam size was tuned on the development set , and a value of 128 was found to achieve a reasonable balance of accuracy and speed; hence this value was used for all experiments. |
Experiments | dependency length on the development set . |
Experiments | Table 1 shows the accuracies of all parsers on the development set , in terms of labeled precision and recall over the predicate-argument dependencies in CCGBank. |
Experiments | We used the NIST MT03 evaluation test data as our development set , and the NIST MT05 as the test set. |
Experiments | Table 4: Experiment results of the sense-based translation model (STM) with lexicon and sense features extracted from a window of size varying from $5 to $15 words on the development set . |
Experiments | Our first group of experiments were conducted to investigate the impact of the window size k on translation performance in terms of BLEUMIST on the development set . |
Experiments | We use the standard split of the Treebank: sections 02-21 as the training data (39832 sentences), section 22 as the development set (1700 sentences), and section 23 as the test set (2416 sentences). |
Experiments | The development set and the test set are parsed with a model trained on all 39832 training sentences. |
Experiments | We use the development set to determine the optimal number of iterations for averaged perceptron, and report the F1 score on the test set. |
Experimental Setup | Training Our model needs to be provided with the number of clusters K. We set K large enough for the model to learn effectively on the development set . |
Experimental Setup | A threshold for this proportion is set for each property via the development set . |
Model Description | Properties with proportions above a set threshold (tuned on a development set ) are predicted as being supported. |
Evaluation | We sampled data from the training and development set of the Persian dependency treebank (Rasooli et al., 2013) to create a comparable seventh dataset in Persian. |
Evaluation | 00 is the upper-bound OOV reduction for our expansion model: for each word in the development set , we ask if our model, without any vocabulary size restriction at all, could generate it. |
Evaluation | Table 5: Results from running a handcrafted Turkish morphological analyzer (Oflazer, 1996) on different expansions and on the development set . |
A Generic Phrase Training Procedure | In the final step 4 (line 15), parameters {Am 7'} are discriminatively trained on a development set using the downhill simplex method (Nelder and Mead, 1965). |
Discussions | We use feature functions to decide the order and the threshold 7' to locate the boundary guided with a development set . |
Introduction | A significant deviation from most other approaches is that the framework is parameterized and can be optimized jointly with the decoder to maximize translation performance on a development set . |
Results and Discussion | The the whole feature set was found in feature ablation testing on the development set to outperform all other feature subsets significantly (p < 2.2 - 10—16). |
Results and Discussion | The development set (00) was used to tune the 6 parameter to obtain reasonable hypertag ambiguity levels; the model was not otherwise tuned to it. |
Results and Discussion | limit; on the development set , this improvement eli-mates more than the number of known search errors (cf. |
Experimental Setup | For parameters estimation, we tune all parameters (utterance selection and path ranking) ex-haustively with 0.1 intervals using our development set . |
Phrasal Query Abstraction Framework | The parameters a and fl are tuned on a development set and sum up to 1. |
Phrasal Query Abstraction Framework | We estimate the percentage of the retrieved utterances based on the development set . |
Language Model Evaluation: Lexical Simplification | The data set contains a development set of 300 examples and a test set of 1710 examples.3 For our experiments, we evaluated the models on the test set. |
Language Model Evaluation: Lexical Simplification | The best lambda was chosen based on a linear search optimized on the SemEval 2012 development set . |
Why Does Unsimplified Data Help? | For the simplification task, the optimal lambda value determined on the development set was 0.98, with a very strong bias towards the simple model. |
Data | Running on the full data set is time-consuming, so development was done on a subset of about 80,000 articles (19.9 million tokens) as a training set and 500 articles as a development set . |
Experiments | For both Wikipedia and Twitter, preliminary experiments on the development set were run to plot the prediction error for each method for each level of resolution, and the optimal resolution for each method was chosen for obtaining test results. |
Experiments | We recomputed the distributions using several values for both parameters and evaluated on the development set . |
Experiments | We evaluate our approach by comparing translation quality, as evaluated by the IBM-BLEU (Papineni et al., 2002) metric on the NIST Chinese-to-English translation task using MT04 as development set to train the model parameters A, and MTOS, MT06 and MT08 as test sets. |
Experiments | We therefore choose N merely based on development set performance. |
Experiments | Unfortunately, variance in development set BLEU scores tends to be higher than test set scores, despite of SAMT MERT’s inbuilt algorithms to overcome local optima, such as random restarts and zeroing-out. |
Experiments | We used DUC-03 as our development set , and tested on DUC-04 data. |
Experiments | DUC-03 was used as development set . |
Experiments | Figure 1 illustrates how ROUGE-1 scores change when 04 and K vary on the development set (DUC-03). |
Experiments | We measured the evolving accuracy of the models on the development set (Figure 4). |
Oracle Parsing | The reason that using the gold-standard supertags doesn’t result in 100% oracle parsing accuracy is that some of the development set parses cannot be constructed by the learned grammar. |
Oracle Parsing | riety of fixed beam settings (Figure l), considering only the subset of our development set which could be parsed with all beam settings. |
Constituent Recombination | The parameters )V; and p are tuned by the Powell’s method (Powell, 1964) on a development set , using the F1 score of PARSEVAL (Black et al., 1991) as objective. |
Experiment | For parser combination, we follow the setting of Fossum and Knight (2009), using Section 24 instead of Section 22 of WSJ treebank as development set . |
Experiment | It is tuned on a development set using the gold sec- |
Experiments | Table 1: Intrinsic evaluation accuracy [‘70] ( development set ) for Arabic segmentation and tagging. |
Experiments | 1 shows development set accuracy for two settings. |
Experiments | We tuned the feature weights on a development set using lattice-based minimum error rate training (MERT) (Macherey et al., |
Experiment | We estimated the optimal values of the stopping probabilities s by using the development set . |
Experiment | In all our experiments, we conducted ten independent runs to train our model, and selected the one that performed best on the development set in terms of parsing accuracy. |
Experiment | development set (S 100). |
Substructure Spaces for BTKs | The coefficient Oi for the composite kernel are tuned with respect to F-measure (F) on the development set of HIT corpus. |
Substructure Spaces for BTKs | Those thresholds are also tuned on the development set of HIT corpus with respect to F-measure. |
Substructure Spaces for BTKs | We use these sentences with less than 50 characters from the NIST MT-2002 test set as the development set (to speed up tuning for syntax based system) and the NIST MT-2005 test set as our test set. |
Experiments | We used as a development set ten additional documents from the Old Bailey proceedings and five additional documents from Trove that were not part of our test set. |
Results and Analysis | This slightly improves performance on our development set and can be thought of as placing a prior on the glyph shape parameters. |
Results and Analysis | We performed error analysis on our development set by randomly choosing 100 word errors from the WER alignment and manually annotating them with relevant features. |
Experiments | We used minimum error rate training (Och, 2003) to tune the feature weights to maximise the BLEU score on the development set . |
Experiments | For the development set we use both ASR devset l and 2 from IWSLT 2005, and |
Experiments | For the development set we use the NIST 2002 test set, and evaluate performance on the test sets from NIST 2003 |
Experiment | The development sets are mainly used to tune the values of the weight factor 04 in Equation 5. |
Experiment | We evaluated the performance (F-score) of our model on the three development sets by using different 04 values, where 04 is progressively increased in steps of 0.1 (0 < 04 < 1.0). |
Experiment | 1The “baseline” uses a different training configuration so that the oz values in the decoding are also need to be tuned on the development sets . |
Experimentation | Besides, we reserve 33 documents in the training set as the development set and use the ground truth entities, times and values for our training and testing. |
Experimentation | Our statistics on the development set shows almost 65% of the event mentions are involved in those Correfrence, Parallel and Sequence relations, which occupy 63%, 50%, 9% respectively6. |
Inferring Inter-Sentence Arguments on Relevant Event Mentions | development set ; tri and tri ’ are triggers of kth and k’th event mention whose event types are et and et ’ in S<,~,j> and S<,~,,~> respectively. |
Experiments | We used the development set for initial development and tuning hyperparameters. |
Experiments | For the GUSP system, we set the hyperparame-ters from initial experiments on the development set , and used them in all subsequent experiments. |
Grounded Unsupervised Semantic Parsing | In preliminary experiments on the development set , we found that the naive model (with multinomials as conditional probabilities) did not perform well in EM. |
Experimental Assessment | We use sections 2-21 for training, 22 as development set , and 23 as test set. |
Experimental Assessment | We train all parsers up to 30 iterations, and for each parser we select the weight vector (3 from the iteration with the best accuracy on the development set . |
Experimental Assessment | We have computed the average value of 7 on our English data set, resulting in 2.98 (variance 2.15) for training set, and 2.95 (variance 1.96) for development set . |
Experiments | The 18 conversations annotated by all 3 annotators are used as test set, and the rest of 70 conversations are used as development set to tune the parameters (determining the best combination weights). |
Experiments | From the development set , we used the grid search method to obtain the best combination weights for the two summarization methods. |
Experiments | In the sentence-ranking method, the best parameters found on the development set are Asim = 0, Are; 2 0.3, Agent 2 0.3, Alen = 0.4. |
RSP: A Random Walk Model for SP | We split the test set equally into two parts: one as the development set and the other as the final test set. |
RSP: A Random Walk Model for SP | Parameters Tuning: The parameters are tuned on the PTB development set , using AFP as the generalization data. |
RSP: A Random Walk Model for SP | This experiment is conducted on the PTB development set with RND confounders. |
Experiment | We use the first 1,419 queries together with their annotated documents as the development set to tune paraphrasing parameters (as we discussed in Section 2.3), and use the rest as the test set. |
Experiment | The ranking model is trained based on the development set . |
Paraphrasing for Web Search | {Qi,D{4abel}f=1 is a human-labeled development set . |
Experiments | We randomly selected a development set and a test set, and then the remaining sentence pairs are for training set. |
Experiments | Furthermore, development set and test set are divided into various intervals according to their best fuzzy match scores. |
Experiments | All the feature weights and the weight for each probability factor (3 factors for Model-III) are tuned on the development set with minimum-error-rate training (MERT) (Och, 2003). |
Experiments | We set aside 132 documents as a development set and use 350 documents as the evaluation set. |
Experiments | We used L2-regu1arization; the regularization parameter was tuned using the development set . |
Experiments | The parameter A was tuned using the development set . |
Experiments | The number of iterations was determined by experiments on the development set . |
Experiments | Tuning the parameter settings on the development set , we found that parameterized categories, binarization, and including punctuation gave the best F1 performance. |
Experiments | While experimenting with the development set of TuBa-D/Z, we noticed that the parser sometimes returns parses, in which paired punctuation (e.g. |