Inference | For each of the remaining feature vectors in |
Inference | Mixture ID (mt) For each feature vector in a segment, given the cluster label CM and the hidden state index st, the derivation of the conditional posterior probability of its mixture ID is straightforward: |
Inference | where mm is the set of mixture IDs of feature vectors that belong to state 8 of HMM c. The mth entry of fl’ is fl + themcqs 6(mt, m), where we use |
Model | Given the cluster label, choose a hidden state for each feature vector :13; in the segment. |
Model | Use the chosen Gaussian mixture to generate the observed feature vector |
Model | 2, where the shaded circle denotes the observed feature vectors , and the squares denote the hyperparameters of the priors used in our model. |
Problem Formulation | 1 illustrates how the speech signal of a single word utterance banana is converted to a sequence of feature vectors to £13711. |
Problem Formulation | Segment (p;- k) We define a segment to be composed of feature vectors between two boundary frames. |
Problem Formulation | Hidden State (8%) Since we assume the observed data are generated by HMMs, each feature vector , 30%, has an associated hidden state index. |
Distributional semantic models | From every image in a dataset, relevant areas are identified and a low-level feature vector (called a “descriptor”) is built to represent each area. |
Distributional semantic models | Now, given a new image, the nearest visual word is identified for each descriptor extracted from it, such that the image can be represented as a BoVW feature vector , by counting the instances of each visual word in the image (note that an occurrence of a low-level descriptor vector in an image, after mapping to the nearest cluster, will increment the count of a single dimension of the higher-level BoVW vector). |
Distributional semantic models | We extract descriptor features of two types.6 First, the standard Scale-Invariant Feature Transform (SIFT) feature vectors (Lowe, 1999; Lowe, 2004), good at characterizing parts of objects. |
Inferring a learning curve from mostly monolingual data | The feature vector qb consists of the following features: |
Inferring a learning curve from mostly monolingual data | We construct the design matrix (I) with one column for each feature vector qbct corresponding to each combination of training configuration 0 and test set If. |
Inferring a learning curve from mostly monolingual data | For a new unseen configuration with feature vector gbu, we determine the parameters 6., of the corresponding learning curve as: |
Experiments | Algorithms 2 and 3 were infeasible to run on Europarl data beyond one epoch because features vectors grew too large to be kept in memory. |
Introduction | The simple but effective idea is to randomly divide training data into evenly sized shards, use stochastic learning on each shard in parallel, while performing 61/62 regularization for joint feature selection on the shards after each epoch, before starting a new epoch with a reduced feature vector averaged across shards. |
Joint Feature Selection in Distributed Stochastic Learning | Let each translation candidate be represented by a feature vector x 6 RD where preference pairs for training are prepared by sorting translations according to smoothed sentence-wise BLEU score (Liang et al., 2006a) against the reference. |
Joint Feature Selection in Distributed Stochastic Learning | Parameter mixing by averaging will help to ease the feature sparsity problem, however, keeping feature vectors on the scale of several million features in memory can be prohibitive. |
Estimating the Tensor Model | We assume a function u that maps outside trees 0 to feature vectors 2M0) 6 Rd]. |
Estimating the Tensor Model | For example, the feature vector might track the rule directly above the node in question, the word following the node in question, and so on. |
Estimating the Tensor Model | We also assume a function gb that maps inside trees 75 to feature vectors gb(t) 6 Rd. |