Index of papers in Proc. ACL that mention
  • loss function
Lin, Shih-Hsiang and Chen, Berlin
A risk minimization framework for extractive summarization
Stated formally, a decision problem may consist of four basic elements: 1) an observation 0 from a random variable 0 , 2) a set of possible decisions (or actions) a e A , 3) the state of nature 669 , and 4) a loss function L(ai,6) which specifies the cost associated with a chosen decision a, given that 6 is the true state of nature.
A risk minimization framework for extractive summarization
itself; (2) PfDlSjS is the sentence generative probability that captures the degree of relevance of S j to the residual document D ; and (3) L(Si,Sj) is the loss function that characterizes the relationship between sentence Si and any other sentence S j.
Abstract
In addition, the introduction of various loss functions also provides the summarization framework with a flexible but systematic way to render the redundancy and coherence relationships among sentences and between sentences and the whole document, respectively.
Proposed Methods
There are many ways to construct the above mentioned three componen mod ls, i.e., the sentence generative model FED | 513 , the sentence prior model P(Sj), and the loss function L(S,.,Sj).
Proposed Methods
4.3 Loss function
Proposed Methods
The loss function introduced in the proposed summarization framework is to measure the relationship between any pair of sentences.
loss function is mentioned in 14 sentences in this paper.
Topics mentioned in this paper:
Cortes, Corinna and Kuznetsov, Vitaly and Mohri, Mehryar
Boosting-style algorithm
Second, the loss function we consider is the normalized Hamming loss over the substructures predictions, which does not match the multi-class losses for the variants of AdaBoost.2 Finally, the natural base hypotheses for this problem admit a structure that can be exploited to devise a more efficient solution, which of course was not part of the original considerations for the design of these variants of AdaBoost.
Boosting-style algorithm
2(Schapire and Singer, 1999) also present an algorithm using the Hamming loss for multi-class classification, but that is a Hamming loss over the set of classes and differs from the loss function relevant to our problem.
Conclusion
A natural extension of this work consists of devising new algorithms and providing learning guarantees specific to other loss functions such as the edit-distance.
Introduction
We will assume that the loss function considered 1dmits an additive decomposition over the sub-;tructures, as is common in structured prediction.
Learning scenario
The quality of the predictions is measured by a loss function L: 3?
Learning scenario
—> R+ that can be decomposed as a sum of loss functions 6k: M, —> R+ over the substructure sets 32k, that is, for all y = (y1,...,yl) E ywith yl" E 32;, andy’ =
Learning scenario
We will assume in all that follows that the loss function L is bounded: L(y,y’) g M for all 1This paper is a modified version of (Cortes et al., 2014a)
Online learning approach
These include the algorithm of (Takimoto and Warmuth, 2003) denoted by WMWP, which is an extension of the (randomized) weighted-majority (WM) algorithm of (Littlestone and Warmuth, 1994) to more general bounded loss functions combined with the Weight Pushing (WP) algorithm of (Mohri, 1997); and the Follow the Perturbed Leader (FPL) algorithm of (Kalai and Vempala, 2005).
Online learning approach
However, since the loss function is additive in the substructures and the updates are multiplicative, it suffices to maintain instead a weight 7.075(6) per transition 6, following the update
loss function is mentioned in 10 sentences in this paper.
Topics mentioned in this paper:
Kruengkrai, Canasai and Uchimoto, Kiyotaka and Kazama, Jun'ichi and Wang, Yiou and Torisawa, Kentaro and Isahara, Hitoshi
Introduction
We describe k-best decoding for our hybrid model and design its loss function and the features appropriate for our task.
Training method
Given a training example (xt, yt), MIRA tries to establish a margin between the score of the correct path 3(xt,yt;w) and the score of the best candidate path 3(xt, j}; w) based on the current weight vector w that is proportional to a loss function L(yt, 37).
Training method
4.3 Loss function
Training method
We instead compute the loss function through false positives (FF) and false negatives (FN).
loss function is mentioned in 7 sentences in this paper.
Topics mentioned in this paper:
Green, Spence and Wang, Sida and Cer, Daniel and Manning, Christopher D.
Adaptive Online Algorithms
1We specify the loss function for MT in section 3.1.
Adaptive Online Algorithms
The relationship to SGD can be seen by lineariz-ing the loss function 6421)) m €t(wt_1) + (w —wt_1)TV€t(wt_1) and taking the derivative of (6).
Adaptive Online Algorithms
MIRA/AROW requires selecting the loss function 6 so that wt can be solved in closed-form, by a quadratic program (QP), or in some other way that is better than linearizing.
Adaptive Online MT
AdaGrad (lines 9—10) is a crucial piece, but the loss function , regularization technique, and parallelization strategy described in this section are equally important in the MT setting.
Adaptive Online MT
3.1 Pairwise Logistic Loss Function
Adaptive Online MT
The pairwise approach results in simple, convex loss functions suitable for online learning.
loss function is mentioned in 7 sentences in this paper.
Topics mentioned in this paper:
Wang, Qin Iris and Schuurmans, Dale and Lin, Dekang
Experimental Results
These positive results are somewhat surprising since a very simple loss function was used on
Introduction
Instead of devising various techniques for coping with non-convex loss functions , we approach the problem from a different perspective.
Introduction
Although using a least squares loss function for classification appears misguided, there is a precedent for just this approach in the early pattern recognition literature (Duda et al., 2000).
Introduction
This loss function has the advantage that the entire training objective on both the labeled and unlabeled data now becomes convex, since it consists of a convex structured large margin loss on labeled data and a convex least squares loss on unlabeled data.
Semi-supervised Structured Large Margin Objective
The resulting loss function has a hat shape (usually called hat-loss), which is non-convex.
loss function is mentioned in 5 sentences in this paper.
Topics mentioned in this paper:
Berg-Kirkpatrick, Taylor and Gillick, Dan and Klein, Dan
Structured Learning
We will perform discriminative training using a loss function that directly measures end-to-end summarization quality.
Structured Learning
We use bigram recall as our loss function (see Section 3.3).
Structured Learning
Luckily, our choice of loss function , bigram recall, factors over bigrams.
loss function is mentioned in 5 sentences in this paper.
Topics mentioned in this paper:
Durrett, Greg and Hall, David and Klein, Dan
Experiments
However, we found no simple way to change the relative performance characteristics of our various systems; notably, modifying the parameters of the loss function mentioned in Section 4 or changing it entirely did not trade off these three metrics but merely increased or decreased them in lockstep.
Learning
This optimizes for the 0-1 loss; however, we are much more interested in optimizing with respect to a coreference-specific loss function .
Learning
We modify Equation 1 to use a new probability distribution P’ instead of P, where P’(a|xi) oc P(a|mi)exp(l(a,C)) and l(a, C) is a loss function .
Learning
In order to perform inference efficiently, l(a, 0) must decompose linearly across mentions: l(a, C) = 2:121 l(az-, C Commonly-used coreference metrics such as MUC (Vilain et al., 1995) and B3 (Bagga and Baldwin, 1998) do not have this property, so we instead make use of a parameterized loss function that does and fit the parameters to give good performance.
Related Work
Our BASIC model is a mention-ranking approach resembling models used by Denis and Baldridge (2008) and Rahman and Ng (2009), though it is trained using a novel parameterized loss function .
loss function is mentioned in 5 sentences in this paper.
Topics mentioned in this paper:
Tang, Duyu and Wei, Furu and Yang, Nan and Zhou, Ming and Liu, Ting and Qin, Bing
Abstract
sentences or tweets) in their loss functions .
Introduction
sentences or tweets) in their loss functions .
Related Work
The loss function of SSWEu is the linear combination of two hinge losses,
Related Work
We learn sentiment-specific word embedding (SSWE) by integrating the sentiment information into the loss functions of three neural networks.
loss function is mentioned in 4 sentences in this paper.
Topics mentioned in this paper:
Silberer, Carina and Lapata, Mirella
Autoencoders for Grounded Semantics
where L is a loss function , such as cross-entropy.
Autoencoders for Grounded Semantics
The reconstruction error for an input x“) with loss function L then is:
Autoencoders for Grounded Semantics
Unimodal Autoencoders For both modalities, we use the hyperbolic tangent function as activation function for encoder f9 and decoder gel and an entropic loss function for L. The weights of each autoencoder are tied, i.e., W’ 2 WT.
loss function is mentioned in 4 sentences in this paper.
Topics mentioned in this paper:
Liu, Shujie and Yang, Nan and Li, Mu and Zhou, Ming
Model Training
The loss function is defined as following so as to minimize the information lost:
Model Training
The loss function is the commonly used ranking loss with a margin, and it is defined as follows:
Model Training
The loss function aims to learn a model which assigns the good translation candidate (the oracle candidate) higher score than the bad ones, with a margin 1.
loss function is mentioned in 4 sentences in this paper.
Topics mentioned in this paper:
Auli, Michael and Gao, Jianfeng
Expected BLEU Training
Next, we fix A, set AMH = l and optimize 6 with respect to the loss function on the training data using stochastic gradient descent (SGD).1
Expected BLEU Training
Formally, we define our loss function [(6) as the negative expected BLEU score, denoted as xBLEU(6) for a given foreign sentence f:
Expected BLEU Training
Next, we define the gradient of the expected BLEU loss function [(6) using the observation that the loss does not explicitly depend on 6:
loss function is mentioned in 4 sentences in this paper.
Topics mentioned in this paper:
Bodenstab, Nathan and Dunlop, Aaron and Hall, Keith and Roark, Brian
Introduction
In addition, we use a non-symmetric loss function during optimization to account for the imbalance between over-predicting or under-predicting the beam-width.
Open/Closed Cell Classification
where H is the unit step function: 1 if the inner product 6 - a: > 0, and 0 otherwise; and L ,\(-, is an asymmetric loss function , defined below.
Open/Closed Cell Classification
To deal with this imbalance, we introduce an asymmetric loss function L ,\(-, to penalize false-negatives more severely during training.
Open/Closed Cell Classification
For Constituent and Complete Closure, we also vary the loss function , adjusting the relative penalty between a false-negative (closing off a chart cell that contains a maximum likelihood edge) and a false-positive.
loss function is mentioned in 4 sentences in this paper.
Topics mentioned in this paper:
Lampos, Vasileios and Preoţiuc-Pietro, Daniel and Cohn, Trevor
Experiments
The loss function in our evaluation is the standard Mean Square Error (MSE), but to allow a better interpretation of the results, we display its root (RMSE) in tables and figures.6
Methods
2 is the standard regularisation loss function , namely the sum squared error over the training instances.4
Methods
Biconvex functions and possible applications have been well studied in the optimisation literature (Quesada and Grossmann, 1995; 4Note that other loss functions could be used here, such as logistic loss for classification, or more generally bilinear
loss function is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Li, Fangtao and Gao, Yang and Zhou, Shuchang and Si, Xiance and Dai, Decheng
Deceptive Answer Prediction with User Preference Graph
where L(waZ-,yi) is a loss function that measures discrepancy between the predicted label wT - x,- and the true label yi, where yz- 6 {+1, —l}.
Deceptive Answer Prediction with User Preference Graph
The common used loss functions include L(p, y) = (p — y)2 (least square), L(p, y) = ln (1 + exp (—py)) (logistic regression).
Deceptive Answer Prediction with User Preference Graph
For simplicity, here we use the least square loss function .
loss function is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Plank, Barbara and Moschitti, Alessandro
Related Work
Instance weighting is a method for domain adaptation in which instance-dependent weights are assigned to the loss function that is minimized during the training process.
Related Work
Let [(30, y, 6) be some loss function .
Related Work
Then, as shown in Jiang and Zhai (2007), the loss function can be weighted by 6il(:c,y,6), such that [3, = where P8 and Pt are the source and target distributions, respectively.
loss function is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
He, Zhengyan and Liu, Shujie and Li, Mu and Zhou, Ming and Zhang, Longkai and Wang, Houfeng
Learning Representation for Contextual Document
The loss function is defined as negative log of 30 f tmax function:
Learning Representation for Contextual Document
The loss function is closely related to contrastive estimation (Smith and Eisner, 2005), which defines where the positive example takes probability mass from.
Learning Representation for Contextual Document
In our experiments, the so ftmazc loss function consistently outperforms pairwise ranking loss function , which is taken as our default setting.
loss function is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Björkelund, Anders and Kuhn, Jonas
Related Work
Durrett and Klein (2013) present a coreference resolver with latent antecedents that predicts clusterings over entire documents and fit a log-linear model with a custom task-specific loss function using AdaGrad (Duchi et al., 2011).
Representation and Learning
A loss function LOSS that quantifies the error in the prediction is used to compute a scalar 7' that controls how much the weights are moved in each update.
Representation and Learning
The loss function can be an arbitrarily complex function that returns a numerical value of how bad the prediction is.
loss function is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Prettenhofer, Peter and Stein, Benno
Cross-Language Text Classification
L is a loss function that measures the quality of the classifier, A is a nonnegative regularization parameter that penalizes model complexity, and ||w||2 = wTw.
Cross-Language Text Classification
Different choices for L entail different classifier types; e.g., when choosing the hinge loss function for L one obtains the popular Support Vector Machine classifier (Zhang, 2004).
Experiments
In particular, the learning rate schedule from PEGASOS is adopted (Shalev-Shwartz et al., 2007), and the modified Huber loss, introduced by Zhang (2004), is chosen as loss function L.3
loss function is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Li, Zhifei and Eisner, Jason and Khudanpur, Sanjeev
Variational vs. Min-Risk Decoding
They use the following loss function , of which a linear approximation to BLEU (Papineni et al., 2001) is a special case,
Variational vs. Min-Risk Decoding
With the above loss function , Tromble et al.
Variational vs. Min-Risk Decoding
15 The MBR becomes the MAP decision rule of (1) if a so-called zero-one loss function is used: l(y, y’) = 0 if y = y’ ; otherwise l(y,y’) = 1.
loss function is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Kumar, Shankar and Macherey, Wolfgang and Dyer, Chris and Och, Franz
Discussion
In this paper, we have described how MERT can be employed to estimate the weights for the linear loss function to maximize BLEU on a development set.
Experiments
Note that N -best MBR uses a sentence BLEU loss function .
Minimum Bayes-Risk Decoding
This reranking can be done for any sentence-level loss function such as BLEU (Papineni et al., 2001), Word Error Rate, or Position-independent Error Rate.
loss function is mentioned in 3 sentences in this paper.
Topics mentioned in this paper: