Experiments of Grammar Formalism Conversion | Table 4: Results of the generative parser on the development set , when trained with various weighting of CTB training set and CDTPS . |
Experiments of Parsing | We used a standard split of CTB for performance evaluation, articles 1-270 and 400-1151 as training set, articles 301-325 as development set , and articles 271-300 as test set. |
Experiments of Parsing | We tried the corpus weighting method when combining CDTPS with CTB training set (abbreviated as CTB for simplicity) as training data, by gradually increasing the weight (including 1, 2, 5, 10, 20, 50) of CTB to optimize parsing performance on the development set . |
Experiments of Parsing | Table 4 presents the results of the generative parser with various weights of CTB on the development set . |
Our Two-Step Solution | The number of removed trees will be determined by cross validation on development set . |
Our Two-Step Solution | The value of A will be tuned by cross validation on development set . |
Our Two-Step Solution | Corpus weighting is exactly such an approach, with the weight tuned on development set , that will be used for parsing on homogeneous treebanks in this paper. |
Discussion | In this paper, we have described how MERT can be employed to estimate the weights for the linear loss function to maximize BLEU on a development set . |
Experiments | Our development set (dev) consists of the NIST 2005 eval set; we use this set for optimizing MBR parameters. |
Experiments | MERT is then performed to optimize the BLEU score on a development set ; For MERT, we use 40 random initial parameters as well as parameters computed using corpus based statistics (Tromble et al., 2008). |
Experiments | We select the MBR scaling factor (Tromble et al., 2008) based on the development set ; it is set to 0.1, 0.01, 0.5, 0.2, 0.5 and 1.0 for the aren-phrase, aren-hier, aren-samt, zhen-phrase zhen-hier and zhen-samt systems respectively. |
Introduction | We employ MERT to select these weights by optimizing BLEU score on a development set . |
Experimental Setup | In our experiments, the development set contains 200 sentences and the test set contains 500 sentences, both of which are randomly selected from the human translations of 2008 NIST Open Machine Translation Evaluation: Chinese to English Task. |
Statistical Paraphrase Generation | = cdev(+7“)/cdev(7“), where cdev(7“) is the total number of unit replacements in the generated paraphrases on the development set . |
Statistical Paraphrase Generation | Replacement rate (rr): rr measures the paraphrase degree on the development set , i.e., the percentage of words that are paraphrased. |
Statistical Paraphrase Generation | We define rr as: 77 = wdev(7“)/wdev(s), where wdev(7“) is the total number of words in the replaced units on the development set, and wdev (s) is the number of words of all sentences on the development set . |
Experiments | To empirically investigate the parameter A and the convergence of our algorithm aPLSA, we generated five more date sets as the development sets . |
Experiments | The detailed description of these five development sets , namely tunel to tune5 is listed in Table 1 as well. |
Experiments | We have done some experiments on the development sets to investigate how different A affect the performance of aPLSA. |
Experiments | The data splitting convention of other two corpora, People’s Daily doesn’t reserve the development sets , so in the following experiments, we simply choose the model after 7 iterations when training on this corpus. |
Experiments | Table 4: Error analysis for Joint S&T on the developing set of CTB. |
Experiments | To obtain further information about what kind of errors be alleviated by annotation adaptation, we conduct an initial error analysis for Joint S&T on the developing set of CTB. |
Experiments | Note that the development set was only used for evaluating the trained model to obtain the optimal values of tunable parameters. |
Experiments | For the baseline policy, we varied 7“ in the range of [1, 5] and found that setting 7“ = 3 yielded the best performance on the development set for both the small and large training corpus experiments. |
Experiments | Optimal balances were selected using the development set . |
Policies for correct path selection | 4In our experiments, the optimal threshold value 7" is selected by evaluating the performance of joint word segmentation and POS tagging on the development set. |
Experiments | For Chinese-English-Spanish translation, we used the development set (devset3) released for the pivot task as the test set, which contains 506 source sentences, with 7 reference translations in English and Spanish. |
Experiments | To be capable of tuning parameters on our systems, we created a development set of 1,000 sentences taken from the training sets, with 3 reference translations in both English and Spanish. |
Experiments | This development set is also used to train the regression learning model. |
Argument Mapping Model | Given these features with gold standard parses, our argument mapping model can predict entire argument mappings with an accuracy rate of 87.96% on the test set, and 87.70% on the development set . |
Identification and Labeling Models | All classifiers were trained to 500 iterations of L-BFGS training — a quasi-Newton method from the numerical optimization literature (Liu and N o-cedal, 1989) — using Zhang Le’s maxent toolkit.2 To prevent overfitting we used Gaussian priors with global variances of l and 5 for the identifier and labeler, respectively.3 The Gaussian priors were determined empirically by testing on the development set . |
Identification and Labeling Models | 4The size of the window was determined experimentally on the development set — we use the same window sizes throughout. |
Experiments | The number of iterations was determined by experiments on the development set . |
Experiments | Tuning the parameter settings on the development set , we found that parameterized categories, binarization, and including punctuation gave the best F1 performance. |
Experiments | While experimenting with the development set of TuBa-D/Z, we noticed that the parser sometimes returns parses, in which paired punctuation (e.g. |
Co-training strategy for prosodic event detection | Development Set 20 1,356 2,275 f2b, f3b Labeled set L 5 347 573 m2b, m3b Unlabeled set U 1,027 77,207 129,305 m4b |
Conclusions | In our experiment, we used some labeled data as development set to estimate some parameters. |
Experiments and results | Among labeled data, 102 utterances of all f] a and m] 19 speakers are used for testing, 20 utterances randomly chosen from f2b, f3b, m2b, m3b, and m4b are used as development set to optimize parameters such as A and confidence level threshold, 5 utterances are used as the initial training set L, and the rest of the data is used as unlabeled set U, which has 1027 unlabeled utterances (we removed the human labels for co-training experiments). |