Gaussian Process Regression | Specifically, we can derive the gradient of the (log) marginal likelihood with respect to the model hyperparameters (i.e., a, an, 08 etc.) |
Gaussian Process Regression | Note that in general the marginal likelihood is non-convex in the hyperparameter values, and consequently the solutions may only be locally optimal. |
Gaussian Process Regression | Here we bootstrap the learning of complex models with many hyperparameters by initialising |
Multitask Quality Estimation 4.1 Experimental Setup | GP: All GP models were implemented using the GPML Matlab toolbox.7 Hyperparameter optimi-sation was performed using conjugate gradient ascent of the log marginal likelihood function, with up to 100 iterations. |
Multitask Quality Estimation 4.1 Experimental Setup | The simpler models were initialised with all hyperparameters set to one, while more complex models were initialised using the |
Bayesian inference for PCFGs | Input: Grammar G, vector of trees t, vector of hyperparameters a, previous parameters 80. |
Bayesian inference for PCFGs | Result: A vector of parameters 8 repeat draw 0 from products of Dirichlet with i hyperparameters oz + f (t) |
Bayesian inference for PCFGs | Input: Grammar G, vector of trees t, vector of hyperparameters a, previous rule parameters 0 . |
Analysis | Next we examine the transition Dirichlet hyperparameters learned by our model. |
Analysis | As we can see, the learned hyperparameters yield highly asymmetric priors over transition distributions. |
Analysis | Figure 4 shows MAP transition Dirichlet hyperparameters of the CLUST model, when trained |
Experiments | The simplest version, SYMM, disregards all information from other languages, using simple symmetric hyperparameters on the transition and emission Dirichlet priors (all hyperparameters set to 1). |
Inference | The second term is the tag transition predictive distribution given Dirichlet hyperparameters , yielding a familiar Polya urn scheme form. |
Inference | Finally, we tackle the third term, Equation 7, corresponding to the predictive distribution of emission observations given Dirichlet hyperparameters . |
Inference | To sample the Dirichlet hyperparameter for cluster k and transition t —> t’, we need to compute: |
Distributional Semantic Hidden Markov Models | We follow the “neutral” setting of hyperparameters given by Ormoneit and Tresp (1995), so that the MAP estimate for the covariance matrix for (event or slot) state 2' becomes: |
Distributional Semantic Hidden Markov Models | where j indexes all the relevant semantic vectors 953- in the training set, rij is the posterior responsibility of state i for vector :53, and [3 is the remaining hyperparameter that we tune to adjust the amount of regularization. |
Distributional Semantic Hidden Markov Models | We tune the hyperparameters (N E, N3, 6, [3, k) and the number of EM iterations by twofold cross-validationl. |
Guided Summarization Slot Induction | We trained a DSHMM separately for each of the five domains with different semantic models, tuning hyperparameters by twofold cross-validation. |
Related Work | Distributions that generate the latent variables and hyperparameters are omitted for clarity. |
Bilingual Infinite Tree Model | Our procedure alternates between sampling each of the following variables: the auxiliary variables u, the state assignments z, the transition probabilities 71', the shared DP parameters ,6, and the hyperparameters 040 and y. |
Bilingual Infinite Tree Model | 040 is parameterized by a gamma hyperprior with hyperparameters aa and 045. |
Bilingual Infinite Tree Model | 7 is parameterized by a gamma hyperprior with hyperparameters ya and 7b. |
Experiment | In sampling a0 and 7, hyperparameters aa, ab, ya, and 7;, are set to 2, 1, 1, and 1, respectively, which is the same setting in Gael et al. |
Experiment | The development test data is used to set up hyperparameters , i.e., to terminate tuning iterations. |
Hierarchical Phrase Table Combination | All the parameters 6j and hyperparameters dj and sj , are obtained by learning on the jth domain. |
Hierarchical Phrase Table Combination | Returning the hyperparameters again when cascading another domain may improve the performance of the combination weight, but we will leave it for future work. |
Phrase Pair Extraction with Unsupervised Phrasal ITGs | 3. d and s are the discount and strengthen hyperparameters . |
Related Work | However, their methods usually require numbers of hyperparameters , such as mini-batch size, step size, or human judgment to determine the quality of phrases, and still rely on a heuristic phrase extraction method in each phrase table update. |
Models | P Number of personas (hyperparameter) K Number of word topics ( hyperparameter ) D Number of movie plot summaries E Number of characters in movie d W Number of (role, word) tuples used by character 6 (bk Topic kr’s distribution over V words. |
Models | Next, let a persona p be defined as a set of three multinomials $10 over these K topics, one for each typed role 7“, each drawn from a Dirichlet with a role-specific hyperparameter (VT). |
Models | In other words, the probability that character 6 embodies persona k is proportional to the number of other characters in the plot summary who also embody that persona (plus the Dirichlet hyperparameter 0%) times the contribution of each observed word wj for that character, given its current topic assignment zj. |
Method | ,u and A are two hyperparameters whose values are discussed in Section 5. |
Method | Based on the development data, the hyperparameters of our model were tuned among the following settings: for the graph propagation, ,u E {0.2,0.5,0.8} and A E {0.1,0.3,0.5,0.8}; for the CRFs training, 04 E {0.1,0.3,0.5,0.7,0.9}. |
Method | With the chosen set of hyperparameters , the test data was used to measure the final performance. |