A risk minimization framework for extractive summarization | Stated formally, a decision problem may consist of four basic elements: 1) an observation 0 from a random variable 0 , 2) a set of possible decisions (or actions) a e A , 3) the state of nature 669 , and 4) a loss function L(ai,6) which specifies the cost associated with a chosen decision a, given that 6 is the true state of nature. |
A risk minimization framework for extractive summarization | itself; (2) PfDlSjS is the sentence generative probability that captures the degree of relevance of S j to the residual document D ; and (3) L(Si,Sj) is the loss function that characterizes the relationship between sentence Si and any other sentence S j. |
Abstract | In addition, the introduction of various loss functions also provides the summarization framework with a flexible but systematic way to render the redundancy and coherence relationships among sentences and between sentences and the whole document, respectively. |
Proposed Methods | There are many ways to construct the above mentioned three componen mod ls, i.e., the sentence generative model FED | 513 , the sentence prior model P(Sj), and the loss function L(S,.,Sj). |
Proposed Methods | 4.3 Loss function |
Proposed Methods | The loss function introduced in the proposed summarization framework is to measure the relationship between any pair of sentences. |
Boosting-style algorithm | Second, the loss function we consider is the normalized Hamming loss over the substructures predictions, which does not match the multi-class losses for the variants of AdaBoost.2 Finally, the natural base hypotheses for this problem admit a structure that can be exploited to devise a more efficient solution, which of course was not part of the original considerations for the design of these variants of AdaBoost. |
Boosting-style algorithm | 2(Schapire and Singer, 1999) also present an algorithm using the Hamming loss for multi-class classification, but that is a Hamming loss over the set of classes and differs from the loss function relevant to our problem. |
Conclusion | A natural extension of this work consists of devising new algorithms and providing learning guarantees specific to other loss functions such as the edit-distance. |
Introduction | We will assume that the loss function considered 1dmits an additive decomposition over the sub-;tructures, as is common in structured prediction. |
Learning scenario | The quality of the predictions is measured by a loss function L: 3? |
Learning scenario | —> R+ that can be decomposed as a sum of loss functions 6k: M, —> R+ over the substructure sets 32k, that is, for all y = (y1,...,yl) E ywith yl" E 32;, andy’ = |
Learning scenario | We will assume in all that follows that the loss function L is bounded: L(y,y’) g M for all 1This paper is a modified version of (Cortes et al., 2014a) |
Online learning approach | These include the algorithm of (Takimoto and Warmuth, 2003) denoted by WMWP, which is an extension of the (randomized) weighted-majority (WM) algorithm of (Littlestone and Warmuth, 1994) to more general bounded loss functions combined with the Weight Pushing (WP) algorithm of (Mohri, 1997); and the Follow the Perturbed Leader (FPL) algorithm of (Kalai and Vempala, 2005). |
Online learning approach | However, since the loss function is additive in the substructures and the updates are multiplicative, it suffices to maintain instead a weight 7.075(6) per transition 6, following the update |
Introduction | We describe k-best decoding for our hybrid model and design its loss function and the features appropriate for our task. |
Training method | Given a training example (xt, yt), MIRA tries to establish a margin between the score of the correct path 3(xt,yt;w) and the score of the best candidate path 3(xt, j}; w) based on the current weight vector w that is proportional to a loss function L(yt, 37). |
Training method | 4.3 Loss function |
Training method | We instead compute the loss function through false positives (FF) and false negatives (FN). |
Adaptive Online Algorithms | 1We specify the loss function for MT in section 3.1. |
Adaptive Online Algorithms | The relationship to SGD can be seen by lineariz-ing the loss function 6421)) m €t(wt_1) + (w —wt_1)TV€t(wt_1) and taking the derivative of (6). |
Adaptive Online Algorithms | MIRA/AROW requires selecting the loss function 6 so that wt can be solved in closed-form, by a quadratic program (QP), or in some other way that is better than linearizing. |
Adaptive Online MT | AdaGrad (lines 9—10) is a crucial piece, but the loss function , regularization technique, and parallelization strategy described in this section are equally important in the MT setting. |
Adaptive Online MT | 3.1 Pairwise Logistic Loss Function |
Adaptive Online MT | The pairwise approach results in simple, convex loss functions suitable for online learning. |
Experimental Results | These positive results are somewhat surprising since a very simple loss function was used on |
Introduction | Instead of devising various techniques for coping with non-convex loss functions , we approach the problem from a different perspective. |
Introduction | Although using a least squares loss function for classification appears misguided, there is a precedent for just this approach in the early pattern recognition literature (Duda et al., 2000). |
Introduction | This loss function has the advantage that the entire training objective on both the labeled and unlabeled data now becomes convex, since it consists of a convex structured large margin loss on labeled data and a convex least squares loss on unlabeled data. |
Semi-supervised Structured Large Margin Objective | The resulting loss function has a hat shape (usually called hat-loss), which is non-convex. |
Structured Learning | We will perform discriminative training using a loss function that directly measures end-to-end summarization quality. |
Structured Learning | We use bigram recall as our loss function (see Section 3.3). |
Structured Learning | Luckily, our choice of loss function , bigram recall, factors over bigrams. |
Experiments | However, we found no simple way to change the relative performance characteristics of our various systems; notably, modifying the parameters of the loss function mentioned in Section 4 or changing it entirely did not trade off these three metrics but merely increased or decreased them in lockstep. |
Learning | This optimizes for the 0-1 loss; however, we are much more interested in optimizing with respect to a coreference-specific loss function . |
Learning | We modify Equation 1 to use a new probability distribution P’ instead of P, where P’(a|xi) oc P(a|mi)exp(l(a,C)) and l(a, C) is a loss function . |
Learning | In order to perform inference efficiently, l(a, 0) must decompose linearly across mentions: l(a, C) = 2:121 l(az-, C Commonly-used coreference metrics such as MUC (Vilain et al., 1995) and B3 (Bagga and Baldwin, 1998) do not have this property, so we instead make use of a parameterized loss function that does and fit the parameters to give good performance. |
Related Work | Our BASIC model is a mention-ranking approach resembling models used by Denis and Baldridge (2008) and Rahman and Ng (2009), though it is trained using a novel parameterized loss function . |
Abstract | sentences or tweets) in their loss functions . |
Introduction | sentences or tweets) in their loss functions . |
Related Work | The loss function of SSWEu is the linear combination of two hinge losses, |
Related Work | We learn sentiment-specific word embedding (SSWE) by integrating the sentiment information into the loss functions of three neural networks. |
Autoencoders for Grounded Semantics | where L is a loss function , such as cross-entropy. |
Autoencoders for Grounded Semantics | The reconstruction error for an input x“) with loss function L then is: |
Autoencoders for Grounded Semantics | Unimodal Autoencoders For both modalities, we use the hyperbolic tangent function as activation function for encoder f9 and decoder gel and an entropic loss function for L. The weights of each autoencoder are tied, i.e., W’ 2 WT. |
Model Training | The loss function is defined as following so as to minimize the information lost: |
Model Training | The loss function is the commonly used ranking loss with a margin, and it is defined as follows: |
Model Training | The loss function aims to learn a model which assigns the good translation candidate (the oracle candidate) higher score than the bad ones, with a margin 1. |
Expected BLEU Training | Next, we fix A, set AMH = l and optimize 6 with respect to the loss function on the training data using stochastic gradient descent (SGD).1 |
Expected BLEU Training | Formally, we define our loss function [(6) as the negative expected BLEU score, denoted as xBLEU(6) for a given foreign sentence f: |
Expected BLEU Training | Next, we define the gradient of the expected BLEU loss function [(6) using the observation that the loss does not explicitly depend on 6: |
Introduction | In addition, we use a non-symmetric loss function during optimization to account for the imbalance between over-predicting or under-predicting the beam-width. |
Open/Closed Cell Classification | where H is the unit step function: 1 if the inner product 6 - a: > 0, and 0 otherwise; and L ,\(-, is an asymmetric loss function , defined below. |
Open/Closed Cell Classification | To deal with this imbalance, we introduce an asymmetric loss function L ,\(-, to penalize false-negatives more severely during training. |
Open/Closed Cell Classification | For Constituent and Complete Closure, we also vary the loss function , adjusting the relative penalty between a false-negative (closing off a chart cell that contains a maximum likelihood edge) and a false-positive. |
Experiments | The loss function in our evaluation is the standard Mean Square Error (MSE), but to allow a better interpretation of the results, we display its root (RMSE) in tables and figures.6 |
Methods | 2 is the standard regularisation loss function , namely the sum squared error over the training instances.4 |
Methods | Biconvex functions and possible applications have been well studied in the optimisation literature (Quesada and Grossmann, 1995; 4Note that other loss functions could be used here, such as logistic loss for classification, or more generally bilinear |
Deceptive Answer Prediction with User Preference Graph | where L(waZ-,yi) is a loss function that measures discrepancy between the predicted label wT - x,- and the true label yi, where yz- 6 {+1, —l}. |
Deceptive Answer Prediction with User Preference Graph | The common used loss functions include L(p, y) = (p — y)2 (least square), L(p, y) = ln (1 + exp (—py)) (logistic regression). |
Deceptive Answer Prediction with User Preference Graph | For simplicity, here we use the least square loss function . |
Related Work | Instance weighting is a method for domain adaptation in which instance-dependent weights are assigned to the loss function that is minimized during the training process. |
Related Work | Let [(30, y, 6) be some loss function . |
Related Work | Then, as shown in Jiang and Zhai (2007), the loss function can be weighted by 6il(:c,y,6), such that [3, = where P8 and Pt are the source and target distributions, respectively. |
Learning Representation for Contextual Document | The loss function is defined as negative log of 30 f tmax function: |
Learning Representation for Contextual Document | The loss function is closely related to contrastive estimation (Smith and Eisner, 2005), which defines where the positive example takes probability mass from. |
Learning Representation for Contextual Document | In our experiments, the so ftmazc loss function consistently outperforms pairwise ranking loss function , which is taken as our default setting. |
Related Work | Durrett and Klein (2013) present a coreference resolver with latent antecedents that predicts clusterings over entire documents and fit a log-linear model with a custom task-specific loss function using AdaGrad (Duchi et al., 2011). |
Representation and Learning | A loss function LOSS that quantifies the error in the prediction is used to compute a scalar 7' that controls how much the weights are moved in each update. |
Representation and Learning | The loss function can be an arbitrarily complex function that returns a numerical value of how bad the prediction is. |
Cross-Language Text Classification | L is a loss function that measures the quality of the classifier, A is a nonnegative regularization parameter that penalizes model complexity, and ||w||2 = wTw. |
Cross-Language Text Classification | Different choices for L entail different classifier types; e.g., when choosing the hinge loss function for L one obtains the popular Support Vector Machine classifier (Zhang, 2004). |
Experiments | In particular, the learning rate schedule from PEGASOS is adopted (Shalev-Shwartz et al., 2007), and the modified Huber loss, introduced by Zhang (2004), is chosen as loss function L.3 |
Variational vs. Min-Risk Decoding | They use the following loss function , of which a linear approximation to BLEU (Papineni et al., 2001) is a special case, |
Variational vs. Min-Risk Decoding | With the above loss function , Tromble et al. |
Variational vs. Min-Risk Decoding | 15 The MBR becomes the MAP decision rule of (1) if a so-called zero-one loss function is used: l(y, y’) = 0 if y = y’ ; otherwise l(y,y’) = 1. |
Discussion | In this paper, we have described how MERT can be employed to estimate the weights for the linear loss function to maximize BLEU on a development set. |
Experiments | Note that N -best MBR uses a sentence BLEU loss function . |
Minimum Bayes-Risk Decoding | This reranking can be done for any sentence-level loss function such as BLEU (Papineni et al., 2001), Word Error Rate, or Position-independent Error Rate. |