Gaussian Process Regression | Specifically, we can derive the gradient of the (log) marginal likelihood with respect to the model hyperparameters (i.e., a, an, 08 etc.) |
Gaussian Process Regression | Note that in general the marginal likelihood is non-convex in the hyperparameter values, and consequently the solutions may only be locally optimal. |
Gaussian Process Regression | Here we bootstrap the learning of complex models with many hyperparameters by initialising |
Multitask Quality Estimation 4.1 Experimental Setup | GP: All GP models were implemented using the GPML Matlab toolbox.7 Hyperparameter optimi-sation was performed using conjugate gradient ascent of the log marginal likelihood function, with up to 100 iterations. |
Multitask Quality Estimation 4.1 Experimental Setup | The simpler models were initialised with all hyperparameters set to one, while more complex models were initialised using the |
Conclusion | Even though we have used a small set of gold-standard alignments to tune our hyperparameters, we found that performance was fairly robust to variation in the hyperparameters , and translation performance was good even when gold-standard alignments were unavailable. |
Experiments | We have implemented our algorithm as an open-source extension to GIZA++.1 Usage of the extension is identical to standard GIZA++, except that the user can switch the (0 prior on or off, and adjust the hyperparameters a and ,6. |
Experiments | We set the hyperparameters a and ,6 by tuning on gold-standard word alignments (to maximize F1) when possible. |
Experiments | The fact that we had to use hand-aligned data to tune the hyperparameters a and ,6 means that our method is no longer completely unsupervised. |
Method | The hyperparameter ,6 controls the tightness of the approximation, as illustrated in Figure 1. |
Bayesian inference for PCFGs | Input: Grammar G, vector of trees t, vector of hyperparameters a, previous parameters 80. |
Bayesian inference for PCFGs | Result: A vector of parameters 8 repeat draw 0 from products of Dirichlet with i hyperparameters oz + f (t) |
Bayesian inference for PCFGs | Input: Grammar G, vector of trees t, vector of hyperparameters a, previous rule parameters 0 . |
Analysis | Next we examine the transition Dirichlet hyperparameters learned by our model. |
Analysis | As we can see, the learned hyperparameters yield highly asymmetric priors over transition distributions. |
Analysis | Figure 4 shows MAP transition Dirichlet hyperparameters of the CLUST model, when trained |
Experiments | The simplest version, SYMM, disregards all information from other languages, using simple symmetric hyperparameters on the transition and emission Dirichlet priors (all hyperparameters set to 1). |
Inference | The second term is the tag transition predictive distribution given Dirichlet hyperparameters , yielding a familiar Polya urn scheme form. |
Inference | Finally, we tackle the third term, Equation 7, corresponding to the predictive distribution of emission observations given Dirichlet hyperparameters . |
Inference | To sample the Dirichlet hyperparameter for cluster k and transition t —> t’, we need to compute: |
Background | The Dirichlet distribution is parametrized by hyperparameters ak(> 0). |
Background | where C(k) is the frequency of choice k in data D. For example, C(k) = C(wi, fk) in the estimation of p( This is very simple: we just need to add the observed counts to the hyperparameters . |
Experiments | We randomly chose 200 sets each for sets “A” and “B.” Set “A” is a development set to tune the value of the hyperparameters and |
Experiments | As for BCb, we assumed that all of the hyperparameters had the same value, i.e., 04k; = 04. |
Experiments | Because tuning hyperparameters involves the possibility of overfitting, its robustness should be assessed. |
Method | Note that with the Dirichlet prior, 04;; 2 04k, + C(wl, fk) and 6,; = 6;, + C(wg, fk), where 04],; and 6;, are the hyperparameters of the priors of ml and 2122, respectively. |
Method | To put it all together, we can obtain a new Bayesian similarity measure on words, which can be calculated only from the hyperparameters for the Dirichlet prior, 04 and 6, and the observed counts C(wi, fk). |
Experiments | Hyperparameters are inferred, which leads to a dominant topic that includes mainly light verbs (have, let, see, do). |
Experiments | Each condition (model, vowel speakers, consonant set) is run five times, using 1500 iterations of Gibbs sampling with hyperparameter sampling. |
Inference: Gibbs Sampling | Squared nodes depict hyperparameters . |
Inference: Gibbs Sampling | A is the set of hyperparameters used by H L when generating lexical items (see Section 3.2). |
Inference: Gibbs Sampling | 5.3 Hyperparameters |
Experimental Setup | Table 1: The values of the hyperparameters of our model, where [id and Ad are the dth entry of the mean and the diagonal of the inverse covariance matrix of training data. |
Experimental Setup | Hyperparameters and Training Iterations The values of the hyperparameters of our model are shown in Table l, where Md and Ad are the dth entry of the mean and the diagonal of the inverse covariance matrix computed from training data. |
Inference | We use P - - - ) to denote a conditional posterior probability given observed data, all the other variables, and hyperparameters for the model. |
Inference | The conjugate prior we use for the two variables is a normal-Gamma distribution with hyperparameters ,uo, 14:0, a0 and fig (Murphy, 2007). |
Inference | Assume we use a symmetric Dirichlet distribution with a positive hyperparameter |
Model | 2, where the shaded circle denotes the observed feature vectors, and the squares denote the hyperparameters of the priors used in our model. |
Results | 6In the future, we plan to extend the model and infer the values of these hyperparameters from data directly. |
Experiments | Table 6 shows examples of zero and nonzero topics for the dev.-tuned hyperparameter values. |
Group Lasso | where Aglas is a hyperparameter tuned on a development data, and Ag is a group specific weight. |
Notation | Both methods disprefer weights of large magnitude; smaller (relative) magnitude means a feature (here, a word) has a smaller effect on the prediction, and zero means a feature has no effect.2 The hyperparameter A in each case is typically tuned on a development dataset. |
Structured Regularizers for Text | As a result, besides Aglas , we have an additional hyperparameter , denoted by Alas. |
Structured Regularizers for Text | Since the lasso-like penalty does not occur naturally in a non tree-structured regularizer, we add an additional lasso penalty for each word type (with hyperparameter Alas) to also encourage weights of irrelevant words to go to zero. |
Structured Regularizers for Text | Similar to the parse tree regularizer, for the lasso-like penalty on each word, we tune one group weight for all word types on a development data with a hyperparameter Alas. |
Algorithm | scalars 756) and vie) are hyperparameters of the |
Algorithm | The generative process of word distributions for non-emotion topics follows the standard LDA definition with a scalar hyperparameter 607’). |
Algorithm | They are generated from Dirichlet priors Dir(oz(e)) and Dir(a(n)) with 04(5) and 0407’) being hyperparameters . |
Experiments | We first settle down the implementation details for the EaLDA model, specifying the hyperparameters that we choose for the experiment. |
Experiments | We set topic number M = 6, K = 4, and hyperparameters 04 = 0.75, 04(6) 2 ox”) = 045,001) = 0.5. |
Experiments | The averages of hyperparameters of PROP were 0.84 d: 0.05 for A and 0.85 d: 0.10 for the threshold. |
Experiments | Proposed Model (PROP): Using the training data, we determined the two hyperparameters , A and the threshold to round gbrs to 1 or 0, so that they maximized the F value. |
Experiments | hand, our model learns parameters such as or for each relation and thus the hyperparameter of our model does not directly affect its performance. |
Generative Model | In this section, we consider relation 7“ since parameters are conditionally independent if relation 7“ and the hyperparameter are given. |
Generative Model | A is the hyperparameter and mst is constant. |
Generative Model | where 0 S A S 1 is the hyperparameter that controls how strongly brs is affected by the main labeling process explained in the previous subsection. |
Distributional Semantic Hidden Markov Models | We follow the “neutral” setting of hyperparameters given by Ormoneit and Tresp (1995), so that the MAP estimate for the covariance matrix for (event or slot) state 2' becomes: |
Distributional Semantic Hidden Markov Models | where j indexes all the relevant semantic vectors 953- in the training set, rij is the posterior responsibility of state i for vector :53, and [3 is the remaining hyperparameter that we tune to adjust the amount of regularization. |
Distributional Semantic Hidden Markov Models | We tune the hyperparameters (N E, N3, 6, [3, k) and the number of EM iterations by twofold cross-validationl. |
Guided Summarization Slot Induction | We trained a DSHMM separately for each of the five domains with different semantic models, tuning hyperparameters by twofold cross-validation. |
Related Work | Distributions that generate the latent variables and hyperparameters are omitted for clarity. |
Conclusion | Because we are interested in applying our techniques to languages for which no labeled resources are available, we paid particular attention to minimize the number of free parameters and used the same hyperparameters for all language pairs. |
Experiments and Results | We paid particular attention to minimize the number of free parameters, and used the same hyperparameters for all language pairs, rather than attempting language-specific tuning. |
Experiments and Results | While we tried to minimize the number of free parameters in our model, there are a few hyperparameters that need to be set. |
Experiments and Results | Fortunately, performance was stable across various values, and we were able to use the same hyperparameters for all languages. |
PCS Projection | , |Vf|) are the label distributions over the foreign language vertices and ,u and V are hyperparameters that we discuss in §6.4. |
Constraints Shape Topics | In this model, a, 6, and 77 are Dirichlet hyperparameters set by the user; their role is explained below. |
Constraints Shape Topics | where TM is the number of times topic k is used in document d, Phwd’n is the number of times the type wdm, is assigned to topic k, and 04, 6 are the hyperparameters of the two Dirichlet distributions, and B is the number of top-level branches (this is the vocabulary size for vanilla LDA). |
Constraints Shape Topics | In order to make the constraints effective, we set the constraint word-distribution hyperparameter 77 to be much larger than the hyperparameter for the distribution over constraints and vocabulary 6. |
Simulation Experiment | The hyperparameters for all experiments are 04 = 0.1, 6 = 0.01, and 77 = 100. |
Bilingual Infinite Tree Model | Our procedure alternates between sampling each of the following variables: the auxiliary variables u, the state assignments z, the transition probabilities 71', the shared DP parameters ,6, and the hyperparameters 040 and y. |
Bilingual Infinite Tree Model | 040 is parameterized by a gamma hyperprior with hyperparameters aa and 045. |
Bilingual Infinite Tree Model | 7 is parameterized by a gamma hyperprior with hyperparameters ya and 7b. |
Experiment | In sampling a0 and 7, hyperparameters aa, ab, ya, and 7;, are set to 2, 1, 1, and 1, respectively, which is the same setting in Gael et al. |
Experiment | The development test data is used to set up hyperparameters , i.e., to terminate tuning iterations. |
Supervised evaluation tasks | After choosing hyperparameters to maximize the dev Fl, we would retrain the model using these hyperparameters on the full 8936 sentence training set, and evaluate on test. |
Supervised evaluation tasks | One hyperparameter was l2-regularization sigma, which for most models was optimal at 2 or 3.2. |
Supervised evaluation tasks | The word embeddings also required a scaling hyperparameter , as described in Section 7.2. |
Unlabled Data | We can scale the embeddings by a hyperparameter , to control their standard deviation. |
Experiments | Special emphasis is put on corpus construction, determination of upper bounds and baselines, and a sensitivity analysis of important hyperparameters . |
Experiments | SGD receives two hyperparameters as input: the number of iterations T, and the regularization parameter A. |
Experiments | Recall that CL-SCL receives three hyperparameters as input: the number of pivots m, the dimensionality of the cross-lingual representation k, |
Introduction | Third, an in-depth analysis with respect to important hyperparameters such as the ratio of labeled and unlabeled documents, the number of pivots, and the optimum dimensionality of the cross-lingual representation. |
Experiments | Since our implementation is based on Unicode and learns all hyperparameters from the data, we also confirmed that NPYLM segments the Arabic Gigawords equally well. |
Inference | 9 Sample hyperparameters of 9 |
Inference | ba Na) to estimate A from the data for given language and word type.7 Here, l‘(:c) is a Gamma function and a, b are the hyperparameters chosen to give a nearly uniform prior distribution.8 |
Pitman-Yor process and n-gram models | 6, d are hyperparameters that can be learned as Gamma and Beta posteriors, respectively, given the data. |
Hierarchical Phrase Table Combination | All the parameters 6j and hyperparameters dj and sj , are obtained by learning on the jth domain. |
Hierarchical Phrase Table Combination | Returning the hyperparameters again when cascading another domain may improve the performance of the combination weight, but we will leave it for future work. |
Phrase Pair Extraction with Unsupervised Phrasal ITGs | 3. d and s are the discount and strengthen hyperparameters . |
Related Work | However, their methods usually require numbers of hyperparameters , such as mini-batch size, step size, or human judgment to determine the quality of phrases, and still rely on a heuristic phrase extraction method in each phrase table update. |
Experiments | There are four hyperparameters in our model to be tuned by using the development data (devMT) among the following settings: for the graph propagation, ,0 E {0205,08} and p E {0.1,0.3,0.5,0.8}; for the PR learning, A E {0 g A,- g 1} and 0 E {0 S a, g 1} where the step is 0.1. |
Experiments | The optimal hyperparameter values were found to be: STS-NO-GP (04 = 0.8) and 77 = 0.6) and STS-GP-PL (,u = 0.5,,0 = 03,04 2 0.8 and 77 = 0.6). |
Experiments | The optimal hyperparameter values were found to be: VES-NO-GP (04 = 0.7) and VES-GP-PL (,u = 0.5, ,0 = 0.3 and 04 = 0.7). |
Methodology | The hyperparameter A is used to control the impacts of the penalty term. |
Experiments | We develop our features and tune their hyperparameter values on the ACE04 development set and then use these on the ACE04 test set.12 On the ACE05 and ACE05-ALL datasets, we directly transfer our Web features and their hyperparameter values from the ACE04 dev-set, without any retuning. |
Semantics via Web Features | To capture this effect, we create a feature that indicates whether there is a match in the top 1:: seeds of the two headwords (where k: is a hyperparameter to tune). |
Semantics via Web Features | We first collect the POS tags (using length 2 character prefixes to indicate coarse parts of speech) of the seeds matched in the top h’ seed lists of the two headwords, where h’ is another hyperparameter to tune. |
Semantics via Web Features | We tune a separate bin-size hyperparameter for each of these three features. |
Evaluation | All hyperparameters ac, BC were held constant at 04, 6 for simplicity and were fit using grid-search over 04 E [10—6,106],6 E [10—3,0.5]. |
Evaluation | Hyperparameters were handled the same way as for GS. |
The STSG Model | The hyperparameters do can be incorporated into the generative model as random variables; however, we opt to fix these at various constants to investigate different levels of sparsity. |
The STSG Model | Assuming fixed hyperparameters a 2 {ac} and ,6 2 {BC}, our inference problem is to find the posterior distribution of the derivation sequences |
Experimental setup | Unless stated otherwise, all results are based on runs of 1,000 iterations with 100 classes, with a 200-iteration bumin period after which hyperparameters were reesti-mated every 50 iterations.3 The probabilities estimated by the models (P(n|v, 7“) for LDA and P(n,v|7“) for ROOTH- and DUAL-LDA) were sampled every 50 iterations post-burnin and averaged over three runs to smooth out variance. |
Results | (2009) demonstrate that LDA is relatively insensitive to the choice of topic vocabulary size Z when the 04 and 6 hyperparameters are optimised appropriately during estimation. |
Results | In fact, we do not find that performance becomes significantly less robust when hyperparameter reestimation is deactiviated; correlation scores simply drop by a small amount (1—2 points), irrespective of the Z chosen. |
Three selectional preference models | Given a dataset of predicate-argument combinations and values for the hyperparameters 04 and 6, the probability model is determined by the class assignment counts fzn and fzv. |
Conclusion and future work | In this paper all of the hyperparameters 04A were tied and varied simultaneously, but it is desirable to learn these from data as well. |
Conclusion and future work | Just before the camera-ready version of this paper was due we developed a method for estimating the hyperparameters by putting a vague Gamma hyper-prior on each 04A and sampled using Metropolis-Hastings with a sequence of increasingly narrow Gamma proposal distributions, producing results for each model that are as good or better than the best ones reported in Table l. |
Word segmentation with adaptor grammars | We tied the Dirichlet Process concentration parameters a, and performed runs with 04 = 1, 10, 100 and 1000; apart from this, no attempt was made to optimize the hyperparameters . |
Word segmentation with adaptor grammars | It may be possible to correct this by “tuning” the grammar’s hyperparameters , but we did not attempt this here. |
Models | P Number of personas (hyperparameter) K Number of word topics ( hyperparameter ) D Number of movie plot summaries E Number of characters in movie d W Number of (role, word) tuples used by character 6 (bk Topic kr’s distribution over V words. |
Models | Next, let a persona p be defined as a set of three multinomials $10 over these K topics, one for each typed role 7“, each drawn from a Dirichlet with a role-specific hyperparameter (VT). |
Models | In other words, the probability that character 6 embodies persona k is proportional to the number of other characters in the plot summary who also embody that persona (plus the Dirichlet hyperparameter 0%) times the contribution of each observed word wj for that character, given its current topic assignment zj. |
Inference | 4.3 Hyperparameter Estimation |
Inference | We treat hyperparameters {d, 0} as random variables and update their values for every MCMC iteration. |
Inference | We place a prior on the hyperparameters as follows: d N Beta(1,1), 6 N Gamma(1,1). |
Evaluating Topic Shift Tendency | 2008 Elections To obtain a posterior estimate of 7r (Figure 3) we create 10 chains with hyperparameters sampled from the uniform distribution U (0, l) and averaged 7r over 10 chains (as described in Section 5). |
Inference | Marginal counts are represented with - and >x< represents all hyperparameters . |
Topic Segmentation Experiments | Initial hyperparameter values are sampled from U (0, 1) to favor sparsity; statistics are collected after 500 burn-in iterations with a lag of 25 iterations over a total of 5000 iterations; and slice sampling (Neal, 2003) optimizes hyperparameters . |
Experimental Setup | Training Regimes and Hyperparameters For each run of our model we perform three random restarts to convergence and select the posterior with lowest final free energy. |
Experimental Setup | Dirichlet hyperparameters are set to 0.1. |
Model | Fixed hyperparameters are subscripted with zero. |
The PYP-HMM | The arrangement of customers at tables defines a clustering which exhibits a power-law behavior controlled by the hyperparameters a and b. |
The PYP-HMM | Sampling hyperparameters We treat the hyper-parameters {(cfl, If”) ,x E (U, B,T, E, 0)} as random variables in our model and infer their values. |
The PYP-HMM | The result of this hyperparameter inference is that there are no user tunable parameters in the model, an important feature that we believe helps explain its consistently high performance across test settings. |
Method | ,u and A are two hyperparameters whose values are discussed in Section 5. |
Method | Based on the development data, the hyperparameters of our model were tuned among the following settings: for the graph propagation, ,u E {0.2,0.5,0.8} and A E {0.1,0.3,0.5,0.8}; for the CRFs training, 04 E {0.1,0.3,0.5,0.7,0.9}. |
Method | With the chosen set of hyperparameters , the test data was used to measure the final performance. |
Model | The generative story runs as follows (Figure 2 depicts the full graphical model): Let there be M unique authors in the data, P latent personas (a hyperparameter to be set), and V words in the vocabulary (in the general setting these may be word types; in our data the vocabulary is the set of 1,000 unique cluster IDs). |
Model | This is proportional to the number of other characters in document d who also (currently) have that persona (plus the Dirichlet hyperparameter which acts as a smoother) times the probability (under pdfi = z) of all of the words |
Model | Number of personas ( hyperparameter ) D Number of documents Cd Number of characters in document d Wd,c Number of (cluster, role) tuples for character 0 md Metadata for document d (ranges over M authors) 0d Document d’s distribution over personas pd,c Character C’s persona j An index for a <7“, w) tuple in the data 1113' Word cluster ID for tuple j rj Role for tuple j 6 {agent, patient, poss, pred} 77 Coefficients for the log-linear language model M, A Laplace mean and scale (for regularizing 77) a Dirichlet concentration parameter |
Inference | Recall that ve is a hyperparameter for the Dirichlet prior on G0 and depends on the value of the corresponding indicator variable A6. |
Inference | Recall that each sparsity indicator A6 determines the value of the corresponding hyperparameter 216 of the Dirichlet prior for the character-edit base distribution Go. |
Model | The prior on the base distribution G0 is a Dirichlet distribution with hyperparameters 27, i.e., g; N Dirichlet(27). |
Code-Switching | We use asymmetric Dirichlet priors (Wallach et al., 2009), and let the optimization process learn the hyperparameters . |
Code-Switching | We optimize the hyperparameters 04, 6, 7 and 6 by interleaving sampling iterations with a Newton-Raphson update to obtain the MLE estimate for the hyperparameters . |
Code-Switching | Where H is the Hessian matrix and 3—2 is the gradient of the likelihood function with respect to the optimizing hyperparameter . |
Hierarchical Topic Models 3.1 Latent Dirichlet Allocation | (1) where 04 and 77 are hyperparameters smoothing the per-attribute set distribution over concepts and per-concept attribute distribution respectively (see Figure 2 for the graphical model). |
Hierarchical Topic Models 3.1 Latent Dirichlet Allocation | The hyperparameter *y controls the probability of branching via the per-node Dirichlet Process, and L is the fixed tree depth. |
Hierarchical Topic Models 3.1 Latent Dirichlet Allocation | Hyperparameters were a=0.1, 7720.1, 721.0. |