Index of papers in Proc. ACL that mention
  • overfitting
Pei, Wenzhe and Ge, Tao and Chang, Baobao
Abstract
Furthermore, a new tensor factorization approach is proposed to speed up the model and avoid overfitting .
Conclusion
Moreover, we propose a tensor factorization approach that effectively improves the model efficiency and avoids the risk of overfitting .
Introduction
by the design of features and the number of features could be so large that the result models are too large for practical use and prone to overfit on training corpus.
Introduction
Moreover, we propose a tensor factorization approach that effectively improves the model efficiency and prevents from overfitting .
Introduction
Not only does this approach improve the efficiency of our model but also it avoids the risk of overfitting .
Max-Margin Tensor Neural Network
Moreover, the additional tensor could bring millions of parameters to the model which makes the model suffer from the risk of overfitting .
Max-Margin Tensor Neural Network
As long as 7“ is small enough, the factorized tensor operation would be much faster than the un-factorized one and the number of free parameters would also be much smaller, which prevent the model from overfitting .
Related Work
However, given the small size of their tensor matrix, they do not have the problem of high time cost and overfitting problem as we faced in modeling a sequence labeling task like Chinese word segmentation.
Related Work
That’s why we propose to decrease computational cost and avoid overfitting with tensor factorization.
Related Work
By introducing tensor factorization into the neural network model for sequence labeling tasks, the model training and inference are speeded up and overfitting is prevented.
overfitting is mentioned in 10 sentences in this paper.
Topics mentioned in this paper:
Mylonakis, Markos and Sima'an, Khalil
Conclusions
We address overfitting issues by cross-validating climbing the likelihood of the training data and propose solutions to increase the efficiency and accuracy of decoding.
Introduction
Estimating such grammars under a Maximum Likelihood criterion is known to be plagued by strong overfitting leading to degenerate estimates (DeNero et al., 2006).
Introduction
In contrast, our learning objective not only avoids overfitting the training data but, most importantly, learns joint stochastic synchronous grammars which directly aim at generalisation towards yet unseen instances.
Learning Translation Structure
On the other hand, estimating the parameters under Maximum-Likelihood Estimation (MLE) for the latent translation structure model 19(0) is bound to overfit towards memorising whole sentence-pairs as discussed in (Mylonakis and Sima’an, 2010), with the resulting grammar estimate not being able to
Learning Translation Structure
However, apart from overfitting towards long phrase-pairs, a grammar with millions of structural rules is also liable to overfit towards degenerate latent structures which, while fitting the training data well, have limited applicability to unseen sentences.
Learning Translation Structure
The CV—criterion, apart from avoiding overfitting , results in discarding the structural rules which are only found in a single part of the training corpus, leading to a more compact grammar while still retaining millions of structural rules that are more hopeful to generalise.
Related Work
We show that a translation system based on such a joint model can perform competitively in comparison with conditional probability models, when it is augmented with a rich latent hierarchical structure trained adequately to avoid overfitting .
Related Work
Cohn and Blunsom (2009) sample rules of the form proposed in (Galley et al., 2004) from a Bayesian model, employing Dirichlet Process priors favouring smaller rules to avoid overfitting .
overfitting is mentioned in 8 sentences in this paper.
Topics mentioned in this paper:
Zhang, Hao and Quirk, Chris and Moore, Robert C. and Gildea, Daniel
Conclusion
On top of these hard constraints, the sparse prior of VB helps make the model less prone to overfitting to infrequent phrase pairs, and thus improves the quality of the phrase pairs the model learns.
Experiments
Using EM, because of overfitting , AER drops first and increases again as the number of iterations varies from 1 to 10.
Experiments
The gain is especially large on the test data set, indicating VB is less prone to overfitting .
Introduction
In this direction, Expectation Maximization at the phrase level was proposed by Marcu and Wong (2002), who, however, experienced two major difficulties: computational complexity and controlling overfitting .
Introduction
Computational complexity arises from the exponentially large number of decompositions of a sentence pair into phrase pairs; overfitting is a problem because as EM attempts to maximize the likelihood of its training data, it prefers to directly explain a sentence pair with a single phrase pair.
Introduction
We address the tendency of EM to overfit by using Bayesian methods, where sparse priors assign greater mass to parameter vectors with fewer nonzero values therefore favoring shorter, more frequent phrases.
Variational Bayes for ITG
If we do not put any constraint on the distribution of phrases, EM overfits the data by memorizing every sentence pair.
overfitting is mentioned in 7 sentences in this paper.
Topics mentioned in this paper:
Yamangil, Elif and Shieber, Stuart M.
Conclusion
Our investigation with variational Bayes showed that the improvement is due both to finding sparse grammars (mitigating overfitting ) and to searching over the space of all grammars (mitigating narrowness).
Evaluation
EM gives a strong baseline since it already uses rules that are limited in depth and number of frontier nodes by stipulation, helping with the overfitting we have mentioned, surprisingly outperforming its discriminative counterpart in both precision and recall (and consequently RelFl).
Evaluation
We conclude that the mitigation of the two factors (narrowness and overfitting ) both contribute to the performance gain of GS.5
Introduction
In summary, previous methods suffer from problems of narrowness of search, having to restrict the space of possible rules, and overfitting in preferring overly specific grammars.
Introduction
We pursue the use of hierarchical probabilistic models incorporating sparse priors to simultaneously solve both the narrowness and overfitting problems.
Introduction
Segmentation is achieved by introducing a prior bias towards grammars that are compact representations of the data, namely by enforcing simplicity and sparsity: preferring simple rules (smaller segments) unless the use of a complex rule is evidenced by the data (through repetition), and thus mitigating the overfitting problem.
The STSG Model
(Eisner, 2003) However, as noted earlier, EM is subject to the narrowness and overfitting problems.
overfitting is mentioned in 7 sentences in this paper.
Topics mentioned in this paper:
Wang, Chang and Fan, James
Experiments
the approaches that completely depend on the labeled data are likely to run into overfitting .
Experiments
Linear SVM performed better than the other two, since the large-margin constraint together with the linear model constraint can alleviate overfitting .
Introduction
When we build a naive model to detect relations, the model tends to overfit for the labeled data.
Relation Extraction with Manifold Models
Integration of the unlabeled data can help solve overfitting problems when the labeled data is not sufficient.
Relation Extraction with Manifold Models
The second term is useful to bound the mapping function f and prevents overfitting from happening.
Relation Extraction with Manifold Models
0 The algorithm exploits unlabeled data, which helps prevent “overfitting” from happening.
overfitting is mentioned in 6 sentences in this paper.
Topics mentioned in this paper:
Hoffmann, Raphael and Zhang, Congle and Weld, Daniel S.
Experiments
Without cross-training we observe a reduction in performance, due to overfitting .
Extraction with Lexicons
However, there is a danger of overfitting , which we discuss in Section 4.2.4.
Extraction with Lexicons
4.2.4 Preventing Lexicon Overfitting
Extraction with Lexicons
If we now train the CRF on the same examples that generated the lexicon features, then the CRF will likely overfit , and weight the lexicon features too highly!
Related Work
Crucual to LUCHS’s different setting is also the need to avoid overfitting .
overfitting is mentioned in 6 sentences in this paper.
Topics mentioned in this paper:
Saha, Sujan Kumar and Mitra, Pabitra and Sarkar, Sudeshna
Abstract
These methods tend to overfit when the available training corpus is limited especially if the number of features is large or the number of values for a feature is large.
Conclusion
This is probably due to reduction of overfitting .
Introduction
In an effort to reduce overfitting , they use a combination of a Gaussian prior and early-stopping.
Introduction
This is due to overfitting which is a serious problem in most of the NLP tasks in resource poor languages where annotated data is scarce.
Maximum Entropy Based Model for Hindi NER
From the above discussion it is clear that the system suffers from overfitting if a large number of features are used to train the system.
overfitting is mentioned in 5 sentences in this paper.
Topics mentioned in this paper:
Wang, William Yang and Hua, Zhenhao
Copula Models for Text Regression
On the other hand, once such assumptions are removed, another problem arises — they might be prone to errors, and suffer from the overfitting issue.
Copula Models for Text Regression
Therefore, coping with the tradeoff between expressiveness and overfitting , seems to be rather important in statistical approaches that capture stochastic dependency.
Copula Models for Text Regression
This is of crucial importance to modeling text data: instead of using the classic bag-of-words representation that uses raw counts, we are now working with uniform marginal CDFs, which helps coping with the overfitting issue due to noise and data sparsity.
Discussions
The second issue is about overfitting .
Experiments
On the pre-2009 dataset, we see that the linear regression and linear SVM perform reasonably well, but the Gaussian kernel SVM performs less well, probably due to overfitting .
overfitting is mentioned in 5 sentences in this paper.
Topics mentioned in this paper:
Lampos, Vasileios and Preoţiuc-Pietro, Daniel and Cohn, Trevor
Experiments
Notice that there is a large performance improvement after the first step (which alone is a linear solver), but overfitting occurs after step 11.
Experiments
This might be a result of overfitting the model to a single response variable which usually has a smooth behaviour.
Experiments
On the contrary, the multitask learning property of BGL reduces this type of overfitting providing more statistical evidence for the terms and users and thus, yielding not only a better inference performance, but also a more accurate model.
Methods
Although flexible, this approach would be doomed to failure due to the sheer size of the resulting feature set, and the propensity to overfit all but the largest of training sets.
Methods
The El-norm regularisation has found many applications in several scientific fields as it encourages sparse solutions which reduce the possibility of overfitting and enhance the interpretability of the inferred model (Hastie et al., 2009).
overfitting is mentioned in 5 sentences in this paper.
Topics mentioned in this paper:
Zhang, Hui and Chiang, David
Introduction
In speech and language processing, smoothing is essential to reduce overfitting , and Kneser-Ney (KN) smoothing (Kneser and Ney, 1995; Chen and Goodman, 1999) has consistently proven to be among the best-performing and most widely used methods.
Word Alignment
It also contains most of the model’s parameters and is where overfitting occurs most.
Word Alignment
However, MLE is prone to overfitting , one symptom of which is the “garbage collection” phenomenon where a rare English word is wrongly aligned to many French words.
Word Alignment
To reduce overfitting , we use expected KN smoothing during the M step.
overfitting is mentioned in 4 sentences in this paper.
Topics mentioned in this paper:
Sarioglu, Efsun and Yadav, Kabir and Choi, Hyeong-Ah
Background
PLSA solves the polysemy problem; however it is not considered a fully generative model of documents and it is known to be overfitting (Blei et al., 2003).
Background
LDA performs better than PLSA for small datasets since it avoids overfitting and it supports polysemy (Blei et al., 2003).
Experiments
LDA was chosen to generate the topic models of clinical reports due to its being a generative probabilistic system for documents and its robustness to overfitting .
Experiments
SVM was chosen as the classification algorithm as it was shown that it performs well in text classification tasks (J oachims, 1998; Yang and Liu, 1999) and it is robust to overfitting (Sebastiani, 2002).
overfitting is mentioned in 4 sentences in this paper.
Topics mentioned in this paper:
Vaswani, Ashish and Huang, Liang and Chiang, David
Conclusion
We have extended the IBM models and HMM model by the addition of an (0 prior to the word-to-word translation model, which compacts the word-to-word translation table, reducing overfitting , and, in particular, the “garbage collection” effect.
Method
Maximum likelihood training is prone to overfitting , especially in models with many parameters.
Method
In word alignment, one well-known manifestation of overfitting is that rare words can act as “garbage collectors”
Method
We have previously proposed another simple remedy to overfitting in the context of unsupervised part-of-speech tagging (Vaswani et al., 2010), which is to minimize the size of the model using a smoothed (0 prior.
overfitting is mentioned in 4 sentences in this paper.
Topics mentioned in this paper:
Croce, Danilo and Giannone, Cristina and Annesi, Paolo and Basili, Roberto
A Distributional Model for Argument Classification
First, we propose a model that does not depend on complex syntactic information in order to minimize the risk of overfitting .
Abstract
The resulting argument classification model promotes a simpler feature space that limits the potential overfitting effects.
Introduction
Notice how this is also a general problem of statistical learning processes, as large fine grain feature sets are more exposed to the risks of overfitting .
Related Work
While these approaches increase the expressive power of the models to capture more general linguistic properties, they rely on complex feature sets, are more demanding about the amount of training information and increase the overall exposure to overfitting effects.
overfitting is mentioned in 4 sentences in this paper.
Topics mentioned in this paper:
Sun, Xu and Wang, Houfeng and Li, Wenjie
System Architecture
The second term is a regularizer for reducing overfitting .
System Architecture
To avoid overfitting , we only collect the word unigrams and bigrams whose frequency is larger than 2 in the training set.
System Architecture
To reduce overfitting , we employed an L2 Gaussian weight prior (Chen and Rosenfeld, 1999) for all training methods.
overfitting is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Simianer, Patrick and Riezler, Stefan and Dyer, Chris
Introduction
Another possible reason why large training data did not yet show the expected improvements in discriminative SMT is a special overfitting problem of current popular online learning techniques.
Introduction
Selecting features jointly across shards and averaging does counter the overfitting effect that is inherent to stochastic updating.
Joint Feature Selection in Distributed Stochastic Learning
Our algorithm 4 (IterSelSGD) introduces feature selection into distributed learning for increased efficiency and as a more radical measure against overfitting .
overfitting is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Tamura, Akihiro and Watanabe, Taro and Sumita, Eiichiro
Introduction
This constraint prevents each model from overfitting to a particular direction and leads to global optimization across alignment directions.
Training
In addition, an [2 regularization term is added to the objective to prevent the model from overfitting the training data.
Training
The proposed constraint penalizes overfitting to a particular direction and enables two directional models to optimize across alignment directions globally.
overfitting is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Sun, Xu and Okazaki, Naoaki and Tsujii, Jun'ichi
Abbreviator with Nonlocal Information
The first term expresses the conditional log-likelihood of the training data, and the second term represents a regularizer that reduces the overfitting problem in parameter estimation.
Abbreviator with Nonlocal Information
Since the number of letters in Chinese (more than 10K characters) is much larger than the number of letters in English (26 letters), in order to avoid a possible overfitting problem, we did not apply these feature templates to Chinese abbreviations.
Experiments
To reduce overfitting , we employed a L2 Gaussian weight prior (Chen and Rosenfeld, 1999), with the objective function: MG) = 221:110gP(yz|Xi,@)-||@||2/02-Dur-ing training and validation, we set 0 = 1 for the DPLVM generators.
overfitting is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Dwyer, Kenneth and Kondrak, Grzegorz
Context ordering
By biasing the decision tree learner toward questions that are intuitively of greater utility, we make it less prone to overfitting on small data samples.
Results
5 The idea of lowering the specificity of letter class questions as the context length increases is due to Kienappel and Kneser (2001), and is intended to avoid overfitting .
Results
Our expectation was that context ordering would be particularly helpful during the early rounds of active learning, when there is a greater risk of overfitting on the small training sets.
overfitting is mentioned in 3 sentences in this paper.
Topics mentioned in this paper: