Abstract | Finding the optimal model parameters is then usually a difficult nonconvex optimization problem. |
Abstract | We search for the maximum-likelihood model parameters and corpus parse, subject to posterior constraints. |
Introduction | The node branches on a single model parameter 6m to partition its subspace. |
Introduction | A variety of ways to find better local optima have been explored, including heuristic initialization of the model parameters (Spitkovsky et al., 2010a), random restarts (Smith, 2006), and annealing (Smith and Eisner, 2006; Smith, 2006). |
Introduction | search with certificates of e-optimality for both the corpus parse and the model parameters . |
The Constrained Optimization Task | The nonlinear constraints ensure that the model parameters are true log-probabilities. |
The Constrained Optimization Task | Feature / model parameter index Sentence index |
The Constrained Optimization Task | Conditional distribution index Number of model parameters |
Experimental results | For the 44 and 230 million tokens corpora, all sentences are automatically parsed and used to initialize model parameters , while for 1.3 billion tokens corpus, we parse the sentences from a portion of the corpus that |
Experimental results | contain 230 million tokens, then use them to initialize model parameters . |
Experimental results | Nevertheless, experimental results show that this approach is effective to provide initial values of model parameters . |
Training algorithm | The objective of maximum likelihood estimation is to maximize the likelihood £(D, p) respect to model parameters . |
Training algorithm | and denote ’2' N as the collection of N -best list parse trees for sentences over entire corpus D under model parameter p. |
Training algorithm | mate model parameters . |
Introduction | From these corpora, we estimate translation model parameters : word-to-word translation tables, fertilities, distortion parameters, phrase tables, syntactic transformations, etc. |
Introduction | A language model P (e) is typically used in SMT decoding (Koehn, 2009), but here P (6) actually plays a central role in training translation model parameters . |
Machine Translation as a Decipherment Task | During decipherment training, our objective is to estimate the model parameters 0 in order to maximize the probability of the foreign corpus f. From Equation 4 we have: |
Machine Translation as a Decipherment Task | 5 For Bayesian MT decipherment, we set a high prior value on the language model (104) and use sparse priors for the IBM 3 model parameters t, n, d,p (0.01, 0.01, 0.01, 0.01). |
Word Substitution Decipherment | During decipherment, our goal is to estimate the channel model parameters 6. |
Word Substitution Decipherment | These methods are attractive for their ability to manage uncertainty about model parameters and allow one to incorporate prior knowledge during inference. |
Base Models | be the value of feature 2' for subtree 7“ over sentence s, and let E9 [fi|s] be the expected value of feature 2' in sentence 3, based on the current model parameters 6. |
Hierarchical Joint Learning | After training has been completed, we retain only the joint model’s parameters . |
Hierarchical Joint Learning | The first summation in this equation computes the log-likelihood of each model, using the data and parameters which correspond to that model, and the prior likelihood of that model’s parameters , based on a Gaussian prior centered around the top-level, non-model-specific parameters 6*, and with model-specific variance am. |
Hierarchical Joint Learning | We need to compute partial derivatives in order to optimize the model parameters . |
In our dataset only 11% of Candidate Relations are valid. | Initialization: Model parameters 0,, = 0 and 00 = 0. |
In our dataset only 11% of Candidate Relations are valid. | As before, 633 is the vector of model parameters , and gbx is the feature function. |
In our dataset only 11% of Candidate Relations are valid. | Therefore, during learning, we need to find the model parameters that maximize expected future reward (Sutton and Barto, 1998). |
Model | Update the model parameters , using the low-level planner’s success or failure as the source of supervision. |
Model | where 6C is the vector of model parameters . |
Models considered 2.1 Basic Conditional Random Fields | The model parameters Nd), then, form the parameters of the leaves of this hierarchy. |
Models considered 2.1 Basic Conditional Random Fields | (3) represent the likelihood of data in each domain given their corresponding model parameters, the second line represents the likelihood of each model parameter in each domain given the hyper-parameter of its parent in the tree hierarchy of features and the last term goes over the entire tree ’2' except the leaf nodes. |
Models considered 2.1 Basic Conditional Random Fields | We perform a MAP estimation for each model parameter as well as the hyper-parameters. |
Detailed generative story | This is a conditional log-linear model parameterized by qb, where gbk, ~ N(0, 0,3). |
Overview and Related Work | For learning, we iteratively adjust our model’s parameters to better explain our samples. |
Overview and Related Work | (2012) we use topics as the contexts, but learn mention topics jointly with other model parameters . |
Parameter Estimation | E-step: Collect samples by MCMC simulation as in §5, given current model parameters 6 and qb. |
Model | We assume the generative model operates by first generating the model parameters from a set of Dirichlet distributions. |
Model | 0 Generating Model Parameters : For every pair of feature type f and phrase tag 2, draw a multinomial distribution parameter 63 from a Dirichlet prior P(6§;). |
Model | Learning the Model During inference, we want to estimate the hidden specification trees 1: given the observed natural language specifications w, after integrating the model parameters out, i.e. |
Robust perceptron learning | Antagonistic adversaries choose transformations informed by the current model parameters w, but random adversaries randomly select transformations from a predefined set of possible transformations, e.g. |
Robust perceptron learning | In an online setting feature bagging can be modelled as a game between a learner and an adversary, in which (a) the adversary can only choose between deleting transformations, (b) the adversary cannot see model parameters when choosing a transformation, and in which (c) the adversary only moves in between passes over the data.1 |
Robust perceptron learning | LRA is an adversarial game in which the two players are unaware of the other player’s current move, and in particular, where the adversary does not see model parameters and only randomly corrupts the data points. |
Introduction | For example, in Expectation-Maximization (Dempster et al., 1977), the Expectation (E) step computes the posterior distribution over possible completions of the data, and the Maximization (M) step reestimates the model parameters as |
Word Alignment | Different models parameterize this probability distribution in different ways. |
Word Alignment | It also contains most of the model’s parameters and is where overfitting occurs most. |
Experiments | This can explain the success of response-based learning: Lexical and structural variants of reference translations can be used to boost model parameters towards translations with positive feedback, while the same translations might be considered as negative examples in standard structured learning. |
Introduction | Here, leam-ing proceeds by “trying out” translation hypotheses, receiving a response from interacting in the task, and converting this response into a supervision signal for updating model parameters . |
Response-based Online Learning | (2010) or Goldwasser and Roth (2013) describe a response-driven learning framework for the area of semantic parsing: Here a meaning representation is “tried out” by itera-tively generating system outputs, receiving feedback from world interaction, and updating the model parameters . |
Model Training | gym] is the plausible score for the best translation candidate given the model parameters W and V . |
Phrase Pair Embedding | Table 1: The relationship between the size of training data and the number of model parameters . |
Phrase Pair Embedding | Table 1 shows the relationship between the size of training data and the number of model parameters . |
Experimental Setup | We should note that since our model parameter A is represented and learned in the low-rank form, we only have to store and maintain the low-rank projections U gbh, ngm and nghm rather than explicitly calculate the feature tensor gbh®gbm®gbh,m. |
Problem Formulation | We will directly learn a low-rank tensor A (because r is small) in this form as one of our model parameters . |
Problem Formulation | where 6 6 RL, U 6 RM”, V 6 RM”, and W E 1Rer are the model parameters to be learned. |
Approaches | Unsupervised Grammar Induction Our first method for grammar induction is fully unsupervised Viterbi EM training of the Dependency Model with Valence (DMV) (Klein and Manning, 2004), with uniform initialization of the model parameters . |
Related Work | (2010a) show that Viterbi (hard) EM training of the DMV with simple uniform initialization of the model parameters yields higher accuracy models than standard soft-EM |
Related Work | In Viterbi EM, the E—step finds the maximum likelihood corpus parse given the current model parameters . |
Decipherment Model for Machine Translation | During decipherment training, our objective is to estimate the model parameters in order to maximize the probability of the source text f as suggested by Ravi and Knight (2011b). |
Decipherment Model for Machine Translation | Instead, we propose a new Bayesian inference framework to estimate the translation model parameters . |
Introduction | The parallel corpora are used to estimate translation model parameters involving word-to-word translation tables, fertilities, distortion, phrase translations, syntactic transformations, etc. |
Generative state tracking | where ¢,(x, y) are feature functions jointly defined on features and labels, and A, are the model parameters . |
Generative state tracking | This formulation also decouples the number of models parameters (i.e. |
Generative state tracking | Second, model parameters in DISCIND are trained independently of competing hypotheses. |
Model | The Bernoulli parameter of a pixel inside a glyph bounding box depends on the pixel’s location inside the box (as well as on di and 21,-, but for simplicity of exposition, we temporarily suppress this dependence) and on the model parameters governing glyph shape (for each character type c, the parameter matrix gbc specifies the shape of the character’s glyph.) |
Results and Analysis | (2010), we use a regularization term in the optimization of the log-linear model parameters (15¢ during the M-step. |
Results and Analysis | Figure 8: The central glyph is a representation of the initial model parameters for the glyph shape for g, and surrounding this are the learned parameters for documents from various years. |
Decipherment | These methods are attractive for their ability to manage uncertainty about model parameters and allow one to incorporate prior knowledge during inference. |
Decipherment | Our goal is to estimate the channel model parameters 6 in order to maximize the probability of the observed ciphertext c: |
Decipherment | The base distribution P0 represents prior knowledge about the model parameter distributions. |
Association Model | Basic Interpolation: This smoothing model, Pinw(e|q), linearly combines our foreground and background models using a model parameter 04: |
Association Model | Section 5.2 outlines our procedure for leam-ing the model parameters for both 15mm(e|q) and |
Experimental Results | 5.2.1 Model Parameters |
Adding Linguistic Knowledge to the Monte-Carlo Framework | Since our model is a nonlinear approximation of the underlying action-value function of the game, we learn model parameters by applying nonlinear regression to the observed final utilities from the simulated roll-outs. |
Adding Linguistic Knowledge to the Monte-Carlo Framework | The resulting update to model parameters 6 is of the form: |
Adding Linguistic Knowledge to the Monte-Carlo Framework | We use the same experimental settings across all methods, and all model parameters are initialized to zero. |
A Model of Semantics | We select the model parameters 6 by maximizing the marginal likelihood of the data, where the data D is given in the form of groups w = |
Empirical Evaluation | When estimating the model parameters , we followed the training regime prescribed in (Liang et al., 2009). |
Inference with NonContradictory Documents | In the supervised case, where a and m are observable, estimation of the generative model parameters is generally straightforward. |
Challenges for Discriminative SMT | This itself provides robustness to noisy data, in addition to the explicit regularisation from a prior over the model parameters . |
Discriminative Synchronous Transduction | Here k ranges over the model’s features, and A = {M} are the model parameters (weights for their corresponding features). |
Discriminative Synchronous Transduction | Each L-BFGS iteration requires the objective value and its gradient with respect to the model parameters . |