Conclusions | We address overfitting issues by cross-validating climbing the likelihood of the training data and propose solutions to increase the efficiency and accuracy of decoding. |
Introduction | Estimating such grammars under a Maximum Likelihood criterion is known to be plagued by strong overfitting leading to degenerate estimates (DeNero et al., 2006). |
Introduction | In contrast, our learning objective not only avoids overfitting the training data but, most importantly, learns joint stochastic synchronous grammars which directly aim at generalisation towards yet unseen instances. |
Learning Translation Structure | On the other hand, estimating the parameters under Maximum-Likelihood Estimation (MLE) for the latent translation structure model 19(0) is bound to overfit towards memorising whole sentence-pairs as discussed in (Mylonakis and Sima’an, 2010), with the resulting grammar estimate not being able to |
Learning Translation Structure | However, apart from overfitting towards long phrase-pairs, a grammar with millions of structural rules is also liable to overfit towards degenerate latent structures which, while fitting the training data well, have limited applicability to unseen sentences. |
Learning Translation Structure | The CV—criterion, apart from avoiding overfitting , results in discarding the structural rules which are only found in a single part of the training corpus, leading to a more compact grammar while still retaining millions of structural rules that are more hopeful to generalise. |
Related Work | We show that a translation system based on such a joint model can perform competitively in comparison with conditional probability models, when it is augmented with a rich latent hierarchical structure trained adequately to avoid overfitting . |
Related Work | Cohn and Blunsom (2009) sample rules of the form proposed in (Galley et al., 2004) from a Bayesian model, employing Dirichlet Process priors favouring smaller rules to avoid overfitting . |