Introduction | With this framework, adaptation to a new domain simply consists of updating a weight vector , and multiple domains can be supported by the same system. |
Introduction | For each sentence that is being decoded, we choose the weight vector that is optimized on the closest cluster, allowing for adaptation even with unlabelled and heterogeneous test data. |
Translation Model Architecture | To combine statistics from a vector of n component corpora, we can use a weighted version of equation 1, which adds a weight vector A of length n (Sennrich, 2012b): |
Translation Model Architecture | Table 1: Illustration of instance weighting with weight vectors for two corpora. |
Translation Model Architecture | In our implementation, the weight vector is set globally, but can be overridden on a per-sentence basis. |
Model | The conditional probability of an assignment 04, given an input sequence x and the weight vector 9 = (61, . |
Model | When performing inference, we wish to select the output sequence with the highest probability, given the input sequence X and the weight vector 9 (i.e., MAP inference). |
Model | A weight vector 9 = (61, . |
Paraphrasing for Web Search | S X3184 = arg min{Z ETTaDz-Label, 62,-; A3184, M” i=1 The objective of MERT is to find the optimal feature weight vector Xi” that minimizes the error criterion Err according to the NDCG scores of top-l paraphrase candidates. |
Paraphrasing for Web Search | where is the best paraphrase candidate according to the paraphrasing model based on the weight vector All”, N(Dfabel, 62,, R) is the NDCG score of computed on the documents ranked by R of Q,- and labeled document set ’Dfabez of 62,-. |
Paraphrasing for Web Search | How to learn the weight vector {A6521 is a standard leaming-to-rank task. |
Introduction | distance between hypotheses when projected onto the line defined by the weight vector w. |
Learning in SMT | Given an input sentence in the source language cc 6 X, we want to produce a translation 3/ E 3/(53) using a linear model parameterized by a weight vector w: |
The Relative Margin Machine in SMT | More formally, the spread is the distance between y+, and the worst candidate (yw, d“’) <— arg min(y,d)€y($i),p($i) 8($i, y, d), after projecting both onto the line defined by the weight vector w. For each y’, this projection is conveniently given by s(:cZ-, y’, d), thus the spread is calculated as 68($i, y+, yw). |
Adaptive Online MT | A fixed threadpool of workers computes gradients in parallel and sends them to a master thread, which updates a central weight vector . |
Adaptive Online MT | During a tuning run, the online method decodes the tuning set under many more weight vectors than a MERT—style batch method. |
Experiments | Our algorithm decodes each example with a new weight vector , thus exploring more of the search space for the same tuning set. |