Conclusion | We have extended the IBM models and HMM model by the addition of an (0 prior to the word-to-word translation model, which compacts the word-to-word translation table, reducing overfitting , and, in particular, the “garbage collection” effect. |
Method | Maximum likelihood training is prone to overfitting , especially in models with many parameters. |
Method | In word alignment, one well-known manifestation of overfitting is that rare words can act as “garbage collectors” |
Method | We have previously proposed another simple remedy to overfitting in the context of unsupervised part-of-speech tagging (Vaswani et al., 2010), which is to minimize the size of the model using a smoothed (0 prior. |
Introduction | Another possible reason why large training data did not yet show the expected improvements in discriminative SMT is a special overfitting problem of current popular online learning techniques. |
Introduction | Selecting features jointly across shards and averaging does counter the overfitting effect that is inherent to stochastic updating. |
Joint Feature Selection in Distributed Stochastic Learning | Our algorithm 4 (IterSelSGD) introduces feature selection into distributed learning for increased efficiency and as a more radical measure against overfitting . |
System Architecture | The second term is a regularizer for reducing overfitting . |
System Architecture | To avoid overfitting , we only collect the word unigrams and bigrams whose frequency is larger than 2 in the training set. |
System Architecture | To reduce overfitting , we employed an L2 Gaussian weight prior (Chen and Rosenfeld, 1999) for all training methods. |