Beam-Width Prediction | Figure 3 is a visual representation of beam-width prediction on a single sentence of the development set using the Berkeley latent-variable grammar and Boundary FOM. |
Conclusion and Future Work | Table 1: Section 22 development set results for CYK and Beam-Search (Beam) parsing using the Berkeley latent-variable grammar. |
Open/Closed Cell Classification | We found the value A = 102 to give the best performance on our development set , and we use this value in all of our experiments. |
Open/Closed Cell Classification | Figures 2a and 2b compare the pruned charts of Chart Constraints and Constituent Closure for a single sentence in the development set . |
Open/Closed Cell Classification | We compare the effectiveness of Constituent Closure, Complete Closure, and Chart Constraints, by decreasing the percentage of chart cells closed until accuracy over all sentences in our development set start to decline. |
Results | In Table 1 we present the accuracy and parse time for three baseline parsers on the development set : exhaustive CYK parsing, beam-search parsing using only the inside score BC), and beam-search parsing using the Boundary FOM. |
Cluster Feature Selection | Our main idea is to learn the best set of prefix lengths, perhaps through the validation of their effectiveness on a development set of data. |
Cluster Feature Selection | Because this method does not need validation on the development set , it is the laziest but the fastest method for selecting clusters. |
Cluster Feature Selection | Exhaustive Search (ES): ES works by trying every possible combination of the set I and picking the one that works the best for the development set . |
Experiments | The set of prefix lengths that worked the best for the development set was chosen to select clusters. |
Experiments | It was interesting that ES did not always outperform the two statistical methods which might be because of its overfitting to the development set . |
Experiments | For the semi-supervised system, each test fold was the same one used in the baseline and the other 4 folds were further split into a training set and a development set in a ratio of 7:3 for selecting clusters. |
Experiments | We measured the evolving accuracy of the models on the development set (Figure 4). |
Oracle Parsing | The reason that using the gold-standard supertags doesn’t result in 100% oracle parsing accuracy is that some of the development set parses cannot be constructed by the learned grammar. |
Oracle Parsing | riety of fixed beam settings (Figure l), considering only the subset of our development set which could be parsed with all beam settings. |
Experiments | We used DUC-03 as our development set , and tested on DUC-04 data. |
Experiments | DUC-03 was used as development set . |
Experiments | Figure 1 illustrates how ROUGE-1 scores change when 04 and K vary on the development set (DUC-03). |
Experiments | The 18 conversations annotated by all 3 annotators are used as test set, and the rest of 70 conversations are used as development set to tune the parameters (determining the best combination weights). |
Experiments | From the development set , we used the grid search method to obtain the best combination weights for the two summarization methods. |
Experiments | In the sentence-ranking method, the best parameters found on the development set are Asim = 0, Are; 2 0.3, Agent 2 0.3, Alen = 0.4. |
Data | Running on the full data set is time-consuming, so development was done on a subset of about 80,000 articles (19.9 million tokens) as a training set and 500 articles as a development set . |
Experiments | For both Wikipedia and Twitter, preliminary experiments on the development set were run to plot the prediction error for each method for each level of resolution, and the optimal resolution for each method was chosen for obtaining test results. |
Experiments | We recomputed the distributions using several values for both parameters and evaluated on the development set . |
Experiments | We evaluate our approach by comparing translation quality, as evaluated by the IBM-BLEU (Papineni et al., 2002) metric on the NIST Chinese-to-English translation task using MT04 as development set to train the model parameters A, and MTOS, MT06 and MT08 as test sets. |
Experiments | We therefore choose N merely based on development set performance. |
Experiments | Unfortunately, variance in development set BLEU scores tends to be higher than test set scores, despite of SAMT MERT’s inbuilt algorithms to overcome local optima, such as random restarts and zeroing-out. |