Background | where hm is a feature function, Am is the associated feature weight , and Z (f) is a constant for normalization: |
Conclusion | As our decoder accounts for multiple derivations, we extend the MERT algorithm to tune feature weights with respect to BLEU score for max-translation decoding. |
Experiments | We found that accounting for all possible derivations in max-translation decoding resulted in a small negative effect on BLEU score (from 30.11 to 29.82), even though the feature weights were tuned with respect to BLEU score. |
Experiments | We concatenate and normalize their feature weights for the joint decoder. |
Extended Minimum Error Rate Training | Minimum error rate training (Och, 2003) is widely used to optimize feature weights for a linear model (Och and Ney, 2002). |
Extended Minimum Error Rate Training | The key idea of MERT is to tune one feature weight to minimize error rate each time while keep others fixed. |
Extended Minimum Error Rate Training | where a is the feature value of current dimension, cc is the feature weight being tuned, and b is the dotproduct of other dimensions. |
Introduction | 0 As multiple derivations are used for finding optimal translations, we extend the minimum error rate training (MERT) algorithm (Och, 2003) to tune feature weights with respect to BLEU score for max-translation decoding (Section 4). |
Introduction | This also makes training the model parameters a challenging problem, since the amount of labeled training data is usually small compared to the size of feature sets: the feature weights cannot be estimated reliably. |
Introduction | Such models require learning individual feature weights directly, so that the number of parameters to be estimated is identical to the size of the feature set. |
Introduction | When millions of features are used but the amount of labeled data is limited, it can be difficult to precisely estimate each feature weight . |
Tensor Space Representation | Most of the learning algorithms for NLP problems are based on vector space models, which represent data as vectors qb E R”, and try to learn feature weight vectors w E R” such that a linear model 3/ = w - qb is able to discriminate between, say, good and bad hypotheses. |
Tensor Space Representation | The feature weight tensor W can be decomposed as the sum of a sequence of rank-l component tensors. |
Tensor Space Representation | Specifically, a vector space model assumes each feature weight to be a “free” parameter, and estimating them reliably could therefore be hard when training data are not sufficient or the feature set is huge. |
Abstract | Our model utilizes a hierarchical prior to link the feature weights for shared features in several single-task models and the joint model. |
Base Models | Let 6 be the feature weights , and f (s, y, yi_1) the feature function over adjacent segments yz- and yi_1 in sentence 3.4 The log likelihood of a semi-CRF for a single sentence 3 is given by: |
Base Models | Let 6 be the vector of feature weights . |
Hierarchical Joint Learning | Each model has its own set of parameters ( feature weights ). |
Hierarchical Joint Learning | These have corresponding log-likelihood functions £p(Dp; 6p), £n(Dn; 6n), and Lj (133-; 63-), where the PS are the training data for each model, and the 6s are the model-specific parameter ( feature weight ) vectors. |
Hierarchical Joint Learning | These three models are linked by a hierarchical prior, and their feature weight vectors are all drawn from this prior. |
Introduction | Then, the singly-annotated data can be used to influence the feature weights for the shared features in the joint model. |
Abstract | A linear model is defined over derivations, and minimum error rate training is used to tune feature weights based on a set of question-answer pairs. |
Introduction | Derivations generated during such a translation procedure are modeled by a linear model, and minimum error rate training (MERT) (Och, 2003) is used to tune feature weights based on a set of question-answer pairs. |
Introduction | 0 Ai denotes the feature weight of |
Introduction | According to the above description, our KB-QA method can be decomposed into four tasks as: (1) search space generation for H(Q); (2) question translation for transforming question spans into their corresponding formal triples; (3) feature design for and (4) feature weight tuning for {A1}. |
Abstract | We describe computationally cheap feature weighting techniques and a novel nonlinear distribution spreading algorithm that can be used to iteratively and interactively correcting mislabeled instances to significantly improve annotation quality at low cost. |
Experiments | Spreading the feature weights reduces the number of data points that must be examined in order to correct the mislabeled instances. |
Feature Weighting Methods | Following this idea, we develop computationally cheap feature weighting techniques to counteract such effect by boosting the weight of discriminative features, so that they would not be subdued and the instances with such features would have higher chance to be correctly classified. |
Feature Weighting Methods | Specifically, we propose a nonlinear distribution spreading algorithm for feature weighting . |
Feature Weighting Methods | The model’s ability to discriminate at the feature level can be further enhanced by leveraging the distribution of feature weights across multiple classes, e.g., multiple emotion categories funny, happy, sad, exciting, boring, etc.. |
A Generic Phrase Training Procedure | 15: Discriminatively train feature weights N, and threshold 7' |
Discussions | The parameter controlling the degree of attenuation in BLT is also optimized together with other feature weights . |
Experimental Results | These feature weights are tuned on the dev set to achieve optimal translation performance using downhill simplex method. |
Experimental Results | Once we have computed all feature values for all phrase pairs in the training corpus, we discriminatively train feature weights Aks and the threshold 7' using the downhill simplex method to maximize the BLEU score on 06dev set. |
Experimental Results | Since the translation engine implements a log-linear model, the discriminative training of feature weights in the decoder should be embedded in the whole end-to-end system jointly with the discriminative phrase table training process. |
Ambiguity-aware Ensemble Training | We apply L2-norm regularized SGD training to iteratively learn feature weights w for our CRF-based baseline and semi-supervised parsers. |
Ambiguity-aware Ensemble Training | We follow the implementation in CRFsuite.2 At each step, the algorithm approximates a gradient with a small subset of the training examples, and then updates the feature weights . |
Ambiguity-aware Ensemble Training | Once the feature weights w are learnt, we can |
Supervised Dependency Parsing | 2; wdep/Sib are feature weight vectors; the dot product gives scores contributed by corresponding subtrees. |
Supervised Dependency Parsing | Assuming the feature weights w are known, the probability of a dependency tree (1 given an input sentence x is defined as: |
Supervised Dependency Parsing | The partial derivative with respect to the feature weights w is: |
Add arc <eC,ej> to GC with | In fact, the score of each arc is calculated as a linear combination of feature weights . |
Add arc <eC,ej> to GC with | (2005a; 2005b), we use the Margin Infused Relaxed Algorithm (MIRA) to learn the feature weights based on a training set of documents annotated with dependency structures yi where yi denotes the correct dependency tree for the text Ti. |
Add arc <eC,ej> to GC with | 'able 5: Top 10 Feature Weights for Coarse-rained Relation Labeling (Eisner Algorithm) |
Abstract | Och (2003) proposed using a log-linear model to incorporate multiple features for translation, and proposed a minimum error rate training (MERT) method to train the feature weights to optimize a desirable translation metric. |
Abstract | (2009) improved a syntactic SMT system by adding as many as ten thousand syntactic features, and used Margin Infused Relaxed Algorithm (MIRA) to train the feature weights . |
Abstract | The feature weights are trained on a tuning set with 2010 sentences using MIRA. |
Experiments and Results | The development data utilized to tune the feature weights of our decoder is NIST’03 evaluation set, and test sets are NIST’05 and NIST’08 evaluation sets. |
Features and Training | Therefore, we can alternatively update graph-based consensus features and feature weights in the log-linear model. |
Features and Training | The decoder then adds the new features and retrains all the feature weights by Minimum Error Rate Training (MERT) (Och, 2003). |
Features and Training | The decoder with new feature weights then provides new n-best candidates and their posteriors for constructing another consensus graph, which in turn gives rise to next round of |
Graph-based Translation Consensus | where 1/) is the feature vector, A is the feature weights , and H (f) is the set of translation hypotheses in the search space. |
Graph-based Translation Consensus | Before elaborating how the graph model of consensus is constructed for both a decoder and N-best output re-ranking in section 5, we will describe how the consensus features and their feature weights can be trained in a semi-supervised way, in section 4. |
Background | where {hm(f, e) | m = l, ..., M} is a set offea-tures, and Am is the feature weight corresponding to the m-th feature. |
Background | In this work, Minimum Error Rate Training (MERT) proposed by Och (2003) is used to estimate feature weights ,1 over a series of training samples. |
Background | where 44(e) is the log-scaled model score of e in the t-th member system, and ,8: is the corresponding feature weight . |
Abstract | The basic idea of 83H is to learn the optimal feature weights from prior knowledge to relocate the data such that similar data have similar hash codes. |
Introduction | Unlike SSH that tries to find a sequence of hash functions, S3H fixes the random projection directions and seeks the optimal feature weights from prior knowledge to relocate the objects such that similar objects have similar fingerprints. |
Semi-Supervised SimHash | where w 6 RM is the feature weight to be determined and d; is the l-th column of the matrix D. |
The direction is determined by concatenating w L times. | Instead, S3H seeks the optimal feature weights via L—BFGS, which is still efficient even for very high-dimensional data. |
The direction is determined by concatenating w L times. | 83H learns the optimal feature weights from prior knowledge to relocate the data such that similar objects have similar fingerprints. |
A Skeleton-based Approach to MT 2.1 Skeleton Identification | Given a translation model m, a language model lm and a vector of feature weights w, the model score of a derivation d is computed by |
A Skeleton-based Approach to MT 2.1 Skeleton Identification | On the other hand, it has different feature weight vectors for indiVidual models (i.e., w and WT). |
A Skeleton-based Approach to MT 2.1 Skeleton Identification | Given a baseline phrase-based system, all we need is to learn the feature weights w and WT on the development set (with source-language skeleton annotation) and the skeletal language model [7777 on the target-language side of the bilingual corpus. |
Evaluation | All feature weights were learned using minimum error rate training (Och, 2003). |
Paraphrasing for Web Search | We utilize minimum error rate training (MERT) (Och, 2003) to optimize feature weights of the paraphrasing model according to NDCG. |
Paraphrasing for Web Search | MERT is used to optimize feature weights of our linear-formed paraphrasing model. |
Paraphrasing for Web Search | S X3184 = arg min{Z ETTaDz-Label, 62,-; A3184, M” i=1 The objective of MERT is to find the optimal feature weight vector Xi” that minimizes the error criterion Err according to the NDCG scores of top-l paraphrase candidates. |
Discussion | Feature Weights |
Discussion | Figure 2: Visual and acoustic feature weights . |
Discussion | To determine the role played by each of the visual and acoustic features, we compare the feature weights assigned by the learning algorithm, as shown in Figure 2. |
Model | the training epoch (x-axis) and parsing feature weights (in legend). |
Model | We have some parameters to tune: parsing feature weight 0p, beam size, and training epoch. |
Model | Figure 2 shows the F1 scores of the proposed model (SegTagDep) on CTB-Sc-l with respect to the training epoch and different parsing feature weights , where “Seg”, “Tag”, and “Dep” respectively denote the F1 scores of word segmentation, POS tagging, and dependency parsing. |
Experimental Results | Next we extract phrase pairs, Hiero rules and tree-to-string rules from the original word alignment and the improved word alignment, and tune all the feature weights on the tuning set. |
Integrating Empty Categories in Machine Translation | The feature weights can be tuned on a tuning set in a log-linear model along with other usual features/costs, including language model scores, bi-direction translation probabilities, etc. |
Related Work | First, in addition to the preprocessing of training data and inserting recovered empty categories, we implement sparse features to further boost the performance, and tune the feature weights directly towards maximizing the machine translation metric. |
Evaluation | One of the main approaches adopted by previous systems involves the identification of features that measure writing skill, and then the application of linear or stepwise regression to find optimal feature weights so that the correlation with manually assigned scores is maximised. |
Previous work | Linear regression is used to assign optimal feature weights that maximise the correlation with the examiner’s scores. |
Previous work | Feature weights and/or scores can be fitted to a marking scheme by stepwise or linear regression. |
A Joint Model with Unlabeled Parallel Text | where 5 is a real-valued vector of feature weights and j? |
A Joint Model with Unlabeled Parallel Text | where (5; and 5: are the vectors of feature weights for L1 and L2, respectively (for brevity we denote them as 61 and 62 in the remaining sections). |
A Joint Model with Unlabeled Parallel Text | where the first term on the right-hand side is the log likelihood of the labeled data from both D1 and D2; the second is the log likelihood of the unlabeled parallel data U, multiplied by Al 2 O, a constant that controls the contribution of the unlabeled data; and x12 2 0 is a regularization constant that penalizes model complexity or large feature weights . |
Experimental Setup | We learned the feature weights with a linear SVM, using the software SVM-OOPS (Woodsend and Gondzio, 2009). |
Experimental Setup | This tool gave us directly the feature weights as well as support vector values, and it allowed different penalties to be applied to positive and negative misclassifications, enabling us to compensate for the unbalanced data set. |
Experimental Setup | For each phrase, features were extracted and salience scores calculated from the feature weights determined through SVM training. |
Discussion | Table 8: Reordering feature weights . |
Discussion | Feature weight analysis: Table 8 shows the syntactic and semantic reordering feature weights . |
Discussion | It shows that the semantic feature weights decrease in the presence of the syntactic features, indicating that the decoder learns to trust semantic features less in the presence of the more accurate syntactic features. |
Experiments on Parsing-Based SMT | The feature weights are tuned on the development set using the minimum error |
Experiments on Phrase-Based SMT | And Koehn's implementation of minimum error rate training (Och, 2003) is used to tune the feature weights on the development set. |
Improving Statistical Bilingual Word Alignment | feature weights , respectively. |
Expt. 2: Predicting Pairwise Preferences | Feature Weights . |
Expt. 2: Predicting Pairwise Preferences | Finally, we make two observations about feature weights in the RTER model. |
Expt. 2: Predicting Pairwise Preferences | Second, good MT evaluation feature weights are not good weights for RTE. |
Introduction | Feature weighting . |
Introduction | Currently, we set all features with an uniform weight w E (0, 1), which is used to control the relative importance of the feature in the final tree similarity: the larger the feature weight , the more important the feature in the final tree similarity. |
Introduction | feature weighting algorithm which can accurately |
Collaborative Decoding | Let 2m be the feature weight vector for member decoder dm, the training procedure proceeds as follows: |
Collaborative Decoding | For each decoder dm, find a new feature weight vector 2;,1 which optimizes the specified evaluation criterion L on D using the MERT algorithm based on the n-best list Jim generated by dm: |
Collaborative Decoding | where T denotes the translations selected by re-ranking the translations in Jim using a new feature weight vector A |