Abstract | Furthermore, a new tensor factorization approach is proposed to speed up the model and avoid overfitting . |
Conclusion | Moreover, we propose a tensor factorization approach that effectively improves the model efficiency and avoids the risk of overfitting . |
Introduction | by the design of features and the number of features could be so large that the result models are too large for practical use and prone to overfit on training corpus. |
Introduction | Moreover, we propose a tensor factorization approach that effectively improves the model efficiency and prevents from overfitting . |
Introduction | Not only does this approach improve the efficiency of our model but also it avoids the risk of overfitting . |
Max-Margin Tensor Neural Network | Moreover, the additional tensor could bring millions of parameters to the model which makes the model suffer from the risk of overfitting . |
Max-Margin Tensor Neural Network | As long as 7“ is small enough, the factorized tensor operation would be much faster than the un-factorized one and the number of free parameters would also be much smaller, which prevent the model from overfitting . |
Related Work | However, given the small size of their tensor matrix, they do not have the problem of high time cost and overfitting problem as we faced in modeling a sequence labeling task like Chinese word segmentation. |
Related Work | That’s why we propose to decrease computational cost and avoid overfitting with tensor factorization. |
Related Work | By introducing tensor factorization into the neural network model for sequence labeling tasks, the model training and inference are speeded up and overfitting is prevented. |
Conclusions | We address overfitting issues by cross-validating climbing the likelihood of the training data and propose solutions to increase the efficiency and accuracy of decoding. |
Introduction | Estimating such grammars under a Maximum Likelihood criterion is known to be plagued by strong overfitting leading to degenerate estimates (DeNero et al., 2006). |
Introduction | In contrast, our learning objective not only avoids overfitting the training data but, most importantly, learns joint stochastic synchronous grammars which directly aim at generalisation towards yet unseen instances. |
Learning Translation Structure | On the other hand, estimating the parameters under Maximum-Likelihood Estimation (MLE) for the latent translation structure model 19(0) is bound to overfit towards memorising whole sentence-pairs as discussed in (Mylonakis and Sima’an, 2010), with the resulting grammar estimate not being able to |
Learning Translation Structure | However, apart from overfitting towards long phrase-pairs, a grammar with millions of structural rules is also liable to overfit towards degenerate latent structures which, while fitting the training data well, have limited applicability to unseen sentences. |
Learning Translation Structure | The CV—criterion, apart from avoiding overfitting , results in discarding the structural rules which are only found in a single part of the training corpus, leading to a more compact grammar while still retaining millions of structural rules that are more hopeful to generalise. |
Related Work | We show that a translation system based on such a joint model can perform competitively in comparison with conditional probability models, when it is augmented with a rich latent hierarchical structure trained adequately to avoid overfitting . |
Related Work | Cohn and Blunsom (2009) sample rules of the form proposed in (Galley et al., 2004) from a Bayesian model, employing Dirichlet Process priors favouring smaller rules to avoid overfitting . |
Conclusion | On top of these hard constraints, the sparse prior of VB helps make the model less prone to overfitting to infrequent phrase pairs, and thus improves the quality of the phrase pairs the model learns. |
Experiments | Using EM, because of overfitting , AER drops first and increases again as the number of iterations varies from 1 to 10. |
Experiments | The gain is especially large on the test data set, indicating VB is less prone to overfitting . |
Introduction | In this direction, Expectation Maximization at the phrase level was proposed by Marcu and Wong (2002), who, however, experienced two major difficulties: computational complexity and controlling overfitting . |
Introduction | Computational complexity arises from the exponentially large number of decompositions of a sentence pair into phrase pairs; overfitting is a problem because as EM attempts to maximize the likelihood of its training data, it prefers to directly explain a sentence pair with a single phrase pair. |
Introduction | We address the tendency of EM to overfit by using Bayesian methods, where sparse priors assign greater mass to parameter vectors with fewer nonzero values therefore favoring shorter, more frequent phrases. |
Variational Bayes for ITG | If we do not put any constraint on the distribution of phrases, EM overfits the data by memorizing every sentence pair. |
Conclusion | Our investigation with variational Bayes showed that the improvement is due both to finding sparse grammars (mitigating overfitting ) and to searching over the space of all grammars (mitigating narrowness). |
Evaluation | EM gives a strong baseline since it already uses rules that are limited in depth and number of frontier nodes by stipulation, helping with the overfitting we have mentioned, surprisingly outperforming its discriminative counterpart in both precision and recall (and consequently RelFl). |
Evaluation | We conclude that the mitigation of the two factors (narrowness and overfitting ) both contribute to the performance gain of GS.5 |
Introduction | In summary, previous methods suffer from problems of narrowness of search, having to restrict the space of possible rules, and overfitting in preferring overly specific grammars. |
Introduction | We pursue the use of hierarchical probabilistic models incorporating sparse priors to simultaneously solve both the narrowness and overfitting problems. |
Introduction | Segmentation is achieved by introducing a prior bias towards grammars that are compact representations of the data, namely by enforcing simplicity and sparsity: preferring simple rules (smaller segments) unless the use of a complex rule is evidenced by the data (through repetition), and thus mitigating the overfitting problem. |
The STSG Model | (Eisner, 2003) However, as noted earlier, EM is subject to the narrowness and overfitting problems. |
Experiments | the approaches that completely depend on the labeled data are likely to run into overfitting . |
Experiments | Linear SVM performed better than the other two, since the large-margin constraint together with the linear model constraint can alleviate overfitting . |
Introduction | When we build a naive model to detect relations, the model tends to overfit for the labeled data. |
Relation Extraction with Manifold Models | Integration of the unlabeled data can help solve overfitting problems when the labeled data is not sufficient. |
Relation Extraction with Manifold Models | The second term is useful to bound the mapping function f and prevents overfitting from happening. |
Relation Extraction with Manifold Models | 0 The algorithm exploits unlabeled data, which helps prevent “overfitting” from happening. |
Experiments | Without cross-training we observe a reduction in performance, due to overfitting . |
Extraction with Lexicons | However, there is a danger of overfitting , which we discuss in Section 4.2.4. |
Extraction with Lexicons | 4.2.4 Preventing Lexicon Overfitting |
Extraction with Lexicons | If we now train the CRF on the same examples that generated the lexicon features, then the CRF will likely overfit , and weight the lexicon features too highly! |
Related Work | Crucual to LUCHS’s different setting is also the need to avoid overfitting . |
Abstract | These methods tend to overfit when the available training corpus is limited especially if the number of features is large or the number of values for a feature is large. |
Conclusion | This is probably due to reduction of overfitting . |
Introduction | In an effort to reduce overfitting , they use a combination of a Gaussian prior and early-stopping. |
Introduction | This is due to overfitting which is a serious problem in most of the NLP tasks in resource poor languages where annotated data is scarce. |
Maximum Entropy Based Model for Hindi NER | From the above discussion it is clear that the system suffers from overfitting if a large number of features are used to train the system. |
Copula Models for Text Regression | On the other hand, once such assumptions are removed, another problem arises — they might be prone to errors, and suffer from the overfitting issue. |
Copula Models for Text Regression | Therefore, coping with the tradeoff between expressiveness and overfitting , seems to be rather important in statistical approaches that capture stochastic dependency. |
Copula Models for Text Regression | This is of crucial importance to modeling text data: instead of using the classic bag-of-words representation that uses raw counts, we are now working with uniform marginal CDFs, which helps coping with the overfitting issue due to noise and data sparsity. |
Discussions | The second issue is about overfitting . |
Experiments | On the pre-2009 dataset, we see that the linear regression and linear SVM perform reasonably well, but the Gaussian kernel SVM performs less well, probably due to overfitting . |
Experiments | Notice that there is a large performance improvement after the first step (which alone is a linear solver), but overfitting occurs after step 11. |
Experiments | This might be a result of overfitting the model to a single response variable which usually has a smooth behaviour. |
Experiments | On the contrary, the multitask learning property of BGL reduces this type of overfitting providing more statistical evidence for the terms and users and thus, yielding not only a better inference performance, but also a more accurate model. |
Methods | Although flexible, this approach would be doomed to failure due to the sheer size of the resulting feature set, and the propensity to overfit all but the largest of training sets. |
Methods | The El-norm regularisation has found many applications in several scientific fields as it encourages sparse solutions which reduce the possibility of overfitting and enhance the interpretability of the inferred model (Hastie et al., 2009). |
Introduction | In speech and language processing, smoothing is essential to reduce overfitting , and Kneser-Ney (KN) smoothing (Kneser and Ney, 1995; Chen and Goodman, 1999) has consistently proven to be among the best-performing and most widely used methods. |
Word Alignment | It also contains most of the model’s parameters and is where overfitting occurs most. |
Word Alignment | However, MLE is prone to overfitting , one symptom of which is the “garbage collection” phenomenon where a rare English word is wrongly aligned to many French words. |
Word Alignment | To reduce overfitting , we use expected KN smoothing during the M step. |
Background | PLSA solves the polysemy problem; however it is not considered a fully generative model of documents and it is known to be overfitting (Blei et al., 2003). |
Background | LDA performs better than PLSA for small datasets since it avoids overfitting and it supports polysemy (Blei et al., 2003). |
Experiments | LDA was chosen to generate the topic models of clinical reports due to its being a generative probabilistic system for documents and its robustness to overfitting . |
Experiments | SVM was chosen as the classification algorithm as it was shown that it performs well in text classification tasks (J oachims, 1998; Yang and Liu, 1999) and it is robust to overfitting (Sebastiani, 2002). |
Conclusion | We have extended the IBM models and HMM model by the addition of an (0 prior to the word-to-word translation model, which compacts the word-to-word translation table, reducing overfitting , and, in particular, the “garbage collection” effect. |
Method | Maximum likelihood training is prone to overfitting , especially in models with many parameters. |
Method | In word alignment, one well-known manifestation of overfitting is that rare words can act as “garbage collectors” |
Method | We have previously proposed another simple remedy to overfitting in the context of unsupervised part-of-speech tagging (Vaswani et al., 2010), which is to minimize the size of the model using a smoothed (0 prior. |
A Distributional Model for Argument Classification | First, we propose a model that does not depend on complex syntactic information in order to minimize the risk of overfitting . |
Abstract | The resulting argument classification model promotes a simpler feature space that limits the potential overfitting effects. |
Introduction | Notice how this is also a general problem of statistical learning processes, as large fine grain feature sets are more exposed to the risks of overfitting . |
Related Work | While these approaches increase the expressive power of the models to capture more general linguistic properties, they rely on complex feature sets, are more demanding about the amount of training information and increase the overall exposure to overfitting effects. |
System Architecture | The second term is a regularizer for reducing overfitting . |
System Architecture | To avoid overfitting , we only collect the word unigrams and bigrams whose frequency is larger than 2 in the training set. |
System Architecture | To reduce overfitting , we employed an L2 Gaussian weight prior (Chen and Rosenfeld, 1999) for all training methods. |
Introduction | Another possible reason why large training data did not yet show the expected improvements in discriminative SMT is a special overfitting problem of current popular online learning techniques. |
Introduction | Selecting features jointly across shards and averaging does counter the overfitting effect that is inherent to stochastic updating. |
Joint Feature Selection in Distributed Stochastic Learning | Our algorithm 4 (IterSelSGD) introduces feature selection into distributed learning for increased efficiency and as a more radical measure against overfitting . |
Introduction | This constraint prevents each model from overfitting to a particular direction and leads to global optimization across alignment directions. |
Training | In addition, an [2 regularization term is added to the objective to prevent the model from overfitting the training data. |
Training | The proposed constraint penalizes overfitting to a particular direction and enables two directional models to optimize across alignment directions globally. |
Abbreviator with Nonlocal Information | The first term expresses the conditional log-likelihood of the training data, and the second term represents a regularizer that reduces the overfitting problem in parameter estimation. |
Abbreviator with Nonlocal Information | Since the number of letters in Chinese (more than 10K characters) is much larger than the number of letters in English (26 letters), in order to avoid a possible overfitting problem, we did not apply these feature templates to Chinese abbreviations. |
Experiments | To reduce overfitting , we employed a L2 Gaussian weight prior (Chen and Rosenfeld, 1999), with the objective function: MG) = 221:110gP(yz|Xi,@)-||@||2/02-Dur-ing training and validation, we set 0 = 1 for the DPLVM generators. |
Context ordering | By biasing the decision tree learner toward questions that are intuitively of greater utility, we make it less prone to overfitting on small data samples. |
Results | 5 The idea of lowering the specificity of letter class questions as the context length increases is due to Kienappel and Kneser (2001), and is intended to avoid overfitting . |
Results | Our expectation was that context ordering would be particularly helpful during the early rounds of active learning, when there is a greater risk of overfitting on the small training sets. |