A multitask transfer learning solution | Let wk denote the weight vector of the linear classifier that separates positive instances of auxiliary type Ak, from negative instances, and let wT denote a similar weight vector for the target type ’2'. |
A multitask transfer learning solution | If different relation types are totally unrelated, these weight vectors should also be independent of each other. |
A multitask transfer learning solution | But because we observe similar syntactic structures across different relation types, we now assume that these weight vectors are related through a common component V2 |
Abstract | The proposed framework models the commonality among different relation types through a shared weight vector , enables knowledge learned from the auxiliary relation types to be transferred to the target relation type, and allows easy control of the tradeoff between precision and recall. |
Conclusions and future work | In the multitask learning framework that we introduced, different relation types are treated as different but related tasks that are learned together, with the common structures among the relation types modeled by a shared weight vector . |
Experiments | the number of nonzero entries in the shared weight vector V. To see how the performance may vary as H changes, we plot the performance of TL-comb and TL-auto in terms of the average Fl across the seven target relation types, with H ranging from 100 to 50000. |
Introducing Nonlocal Features | We now outline three different ways of learning the weight vector 21) with nonlocal features. |
Introducing Nonlocal Features | In other words, it is unlikely that we can devise a feature set that is informative enough to allow the weight vector to converge towards a solution that lets the learning algorithm see the entire documents during training, at least in the situation when no external knowledge sources are used. |
Representation and Learning | The score of an arc (ai, mi) is defined as the scalar product between a weight vector 21) and a feature vector (13((ai, m»), where (I) is a feature extraction function over an arc (thus extracting features from the antecedent and the anaphor). |
Representation and Learning | We find the weight vector 21) by online learning using a variant of the structured perceptron (Collins, 2002). |
Representation and Learning | For each instance it uses the current weight vector 21) to make a prediction 3),- given the input 30,. |
Introduction | With this framework, adaptation to a new domain simply consists of updating a weight vector , and multiple domains can be supported by the same system. |
Introduction | For each sentence that is being decoded, we choose the weight vector that is optimized on the closest cluster, allowing for adaptation even with unlabelled and heterogeneous test data. |
Translation Model Architecture | To combine statistics from a vector of n component corpora, we can use a weighted version of equation 1, which adds a weight vector A of length n (Sennrich, 2012b): |
Translation Model Architecture | Table 1: Illustration of instance weighting with weight vectors for two corpora. |
Translation Model Architecture | In our implementation, the weight vector is set globally, but can be overridden on a per-sentence basis. |
Training method | Input: Training set 8 = {(xt,yt)}tT=1 Output: Model weight vector w |
Training method | where w is a weight vector and f is a feature representation of an input x and an output y. |
Training method | Learning a mapping between an input-output pair corresponds to finding a weight vector w such that the best scoring path of a given sentence is the same as (or close to) the correct path. |
Online Learning Algorithm | , D, surrogate weight vector 2) |
Online Learning Algorithm | (a) Mapping a surrogate weight vector to a tensor X1 |
Online Learning Algorithm | Figure 2: Algorithm for mapping a surrogate weight vector X to a tensor. |
Tensor Model Construction | As a way out, we first run a simple vector-model based learning algorithm (say the Perceptron) on the training data and estimate a weight vector , which serves as a “surro- |
Tensor Model Construction | gate” weight vector . |
Tensor Space Representation | Most of the learning algorithms for NLP problems are based on vector space models, which represent data as vectors qb E R”, and try to learn feature weight vectors w E R” such that a linear model 3/ = w - qb is able to discriminate between, say, good and bad hypotheses. |
Experiments | 0 Delta-IDF: Takes the dot product of the Delta IDF weight vector (Formula 1) with the document’s term frequency vector. |
Experiments | 0 Spread: Takes the dot product of the distribution spread weight vector (Formula 3) with the document’s term frequency vector. |
Feature Weighting Methods | We calculate the Delta IDF score of every term in V, and get the Delta IDF weight vector A = (A_z'df1, ..., A_idf|V|) for all terms. |
Feature Weighting Methods | When the dataset is imblanced, to avoid building a biased model, we down sample the majority class before calculating the Delta IDF score and then use the a bias balancing procedure to balance the Delta IDF weight vector . |
Feature Weighting Methods | This procedure first divides the Delta IDF weight vector to two vectors, one of which contains all the features with positive scores, and the other of which contains all the features with negative scores. |
Structural SVMs | Let x be a document and wm a weight vector associated with the genre class m in a corpus with k genres at the most fine-grained level. |
Structural SVMs | The predicted class is the class achieving the maximum inner product between x and the weight vector for the class, denoted as, |
Structural SVMs | Accurate prediction requires that when a document vector is multiplied with the weight vector associated with its own class, the resulting inner product should be larger than its inner products with a weight vector for any other genre class m. This helps us to define criteria for weight vectors . |
Experiment | Table 1: Data, markedness matrix, weight vector , and joint log-probabilities for the IBPOT and the phonological standard constraints. |
The IBPOT Model | The IBPOT model defines a generative process for mappings between input and output forms based on three latent variables: the constraint violation matrices F (faithfulness) and M (markedness), and the weight vector w. The cells of the violation matrices correspond to the number of violations of a constraint by a given input-output mapping. |
The IBPOT Model | The weight vector w provides weight for both F and M. Probabilities of output forms are given by a log-linear function: |
The IBPOT Model | We initialize the model with a randomly-drawn markedness violation matrix M and weight vector 212. |
Algorithm | Finding the weight vector 0 that minimizes the £2-regularized average of this loss function is the structured support vector machine (SVM) problem (Taskar et al., 2003; Tsochantaridis et al., 2005): |
Algorithm | Denote by 0t_1 the value of the weight vector before the t-th round. |
Algorithm | Let Agbi = qb(}_9i, wi) — qbQ‘oi, Then the algorithm updates the weight vector 0’5 as follows: |
Experiments | The single-threaded running time for PNDP+ and Pegasos/DP+ is about 40 minutes per epoch, measured on a dual-core AMD 2.4GHz CPU with 8GB of memory; for CRF, it takes about 100 minutes for each epoch, which is almost entirely because the weight vector 0 is less sparse with CRF learning. |
Model | The conditional probability of an assignment 04, given an input sequence x and the weight vector 9 = (61, . |
Model | When performing inference, we wish to select the output sequence with the highest probability, given the input sequence X and the weight vector 9 (i.e., MAP inference). |
Model | A weight vector 9 = (61, . |
Forest Reranking | As usual, we define the score of a parse y to be the dot product between a high dimensional feature representation and a weight vector w: |
Forest Reranking | Using a machine learning algorithm, the weight vector w can be estimated from the training data where each sentence 3,- is labelled with its correct (“gold-standard”) parse As for the learner, Collins (2000) uses the boosting algorithm and Charniak and Johnson (2005) use the maximum entropy estimator. |
Forest Reranking | Now we train the reranker to pick the oracle parses as often as possible, and in case an error is made (line 6), perform an update on the weight vector (line 7), by adding the difference between two feature representations. |
Experiments | The initial weight vector was 0. |
Experiments | If not indicated otherwise, the perceptron was run for 10 epochs with learning rate 77 = 0.0001, started at zero weight vector , using deduplicated 100-best lists. |
Joint Feature Selection in Distributed Stochastic Learning | The mixed weight vector is resent to each shard to start another epoch of training in parallel on each shard. |
Joint Feature Selection in Distributed Stochastic Learning | Reduced weight vectors are mixed and the result is resent to each shard to start another epoch of parallel training on each shard. |
Introduction | L1 regularization penalizes the weight vector for its Ll-norm (i.e. |
Log-Linear Models | In effect, it forces the weight to receive the total Ll penalty that would have been applied if the weight had been updated by the true gradients, assuming that the current weight vector resides in the same orthant as the true weight vector . |
Log-Linear Models | problem as a Ll-constrained problem (Lee et al., 2006), where the conditional log-likelihood of the training data is maximized under a fixed constraint of the Ll-norm of the weight vector . |
Log-Linear Models | (2008) describe efficient algorithms for projecting a weight vector onto the Ll-ball. |
Paraphrasing for Web Search | S X3184 = arg min{Z ETTaDz-Label, 62,-; A3184, M” i=1 The objective of MERT is to find the optimal feature weight vector Xi” that minimizes the error criterion Err according to the NDCG scores of top-l paraphrase candidates. |
Paraphrasing for Web Search | where is the best paraphrase candidate according to the paraphrasing model based on the weight vector All”, N(Dfabel, 62,, R) is the NDCG score of computed on the documents ranked by R of Q,- and labeled document set ’Dfabez of 62,-. |
Paraphrasing for Web Search | How to learn the weight vector {A6521 is a standard leaming-to-rank task. |
Adaptive Online MT | A fixed threadpool of workers computes gradients in parallel and sends them to a master thread, which updates a central weight vector . |
Adaptive Online MT | During a tuning run, the online method decodes the tuning set under many more weight vectors than a MERT—style batch method. |
Experiments | Our algorithm decodes each example with a new weight vector , thus exploring more of the search space for the same tuning set. |
Intervention Prediction Models | The model uses the pseudocode shown in Algorithm 1 to iteratively refine the weight vectors . |
Intervention Prediction Models | Assuming that p represents posts of thread 75, h represents the latent category assignments, 7“ represents the intervention decision; feature vector, qb(p, 7“, h, t), is extracted for each thread and using the weight vector , w, this model defines a decision function, similar to what is shown in Equation 1. |
Intervention Prediction Models | w is the weight vector , is the squared hinge loss function and fw (tj, pj) is defined in Equation 1. |
Collaborative Decoding | Let 2m be the feature weight vector for member decoder dm, the training procedure proceeds as follows: |
Collaborative Decoding | For each decoder dm, find a new feature weight vector 2;,1 which optimizes the specified evaluation criterion L on D using the MERT algorithm based on the n-best list Jim generated by dm: |
Collaborative Decoding | where T denotes the translations selected by re-ranking the translations in Jim using a new feature weight vector A |
Cross-Language Structural Correspondence Learning | to constrain the hypothesis space, i.e., the space of possible weight vectors , of the target task by considering multiple different but related prediction tasks. |
Cross-Language Structural Correspondence Learning | The subspace is used to constrain the learning of the target task by restricting the weight vector w to lie in the subspace defined by 6T. |
Cross-Language Text Classification | wis a weight vector that parameterizes the classifier, denotes the matrix transpose. |
The summarization framework | We trained a Linear Regression classifier to learn the weight vector W = (7.01, w2, 2123, 2124) that would combine the above feature. |
The summarization framework | It was calculated as dot product between the learned weight vector W and the feature vector for answer \II“. |
The summarization framework | In order to learn the weight vector V that would combine the above scores, we asked three human annotators to generate question-biased extractive summaries based on all answers available for a certain question. |
Introduction | distance between hypotheses when projected onto the line defined by the weight vector w. |
Learning in SMT | Given an input sentence in the source language cc 6 X, we want to produce a translation 3/ E 3/(53) using a linear model parameterized by a weight vector w: |
The Relative Margin Machine in SMT | More formally, the spread is the distance between y+, and the worst candidate (yw, d“’) <— arg min(y,d)€y($i),p($i) 8($i, y, d), after projecting both onto the line defined by the weight vector w. For each y’, this projection is conveniently given by s(:cZ-, y’, d), thus the spread is calculated as 68($i, y+, yw). |
Multi-objective Algorithms | For each sentence pair (f, e) in the devset, we first generate an N-best list L E {h} using the current weight vector w (line 5). |
Multi-objective Algorithms | Input: Devset, max number of iterations I Output: A set of (pareto-optimal) weight vectors 1: Initialize 111. |
Theory of Pareto Optimality 2.1 Definitions and Concepts | Here, the MT system’s Decode function, parameterized by weight vector w, takes in a foreign sentence f and returns a translated hypothesis h. The argmax operates in vector space and our goal is to find to leading to hypotheses on the Pareto Frontier. |
Adding Linguistic Knowledge to the Monte-Carlo Framework | where y,- is the ith hidden unit of 37, and {ii is the weight vector corresponding to yi. |
Adding Linguistic Knowledge to the Monte-Carlo Framework | Q(8t7a’t7 Z “7 ° f7 where 7.3 is the weight vector . |
Monte-Carlo Framework for Computer Games | Here f (s, a) E R” is a real-valued feature function, and U7 is a weight vector . |
Empirical Analysis | In the training process of HL-flat, the algorithm reflexes the restriction in the HL-SOT algorithm that requires the weight vector wig; of the classifier i is only updated on the examples that are positive for its parent node. |
The HL-SOT Approach | Defining the f function Let wl, ..., 212 N be weight vectors that define linear-threshold classifiers ofeach node in SOT. |
The HL-SOT Approach | The Formula 1 restricts that the weight vector wig; of the classifier i is only updated on the examples that are positive for its parent node. |