Learning Soft Linear Constraints with Application to Citation Field Extraction

Accurately segmenting a citation string into fields for authors, titles, etc.

Citation field extraction, an instance of information extraction, is the task of segmenting and labeling research paper citation strings into their constituent parts, including authors, editors, year, journal, volume, conference venue, etc.

2.1 Structured Linear Models

We now introduce an extension of Algorithm 1 to handle soft constraints.

We consider the UMass citation dataset, first introduced in Anzaroot and McCallum (2013).

Appears in 25 sentences as: soft constraint (2) Soft constraints (1) soft constraints (22)

In *Learning Soft Linear Constraints with Application to Citation Field Extraction*

- Previous work has shown that modeling soft constraints , where the model is encouraged, but not require to obey the constraints, can substantially improve segmentation performance.Page 1, “Abstract”
- We extend dual decomposition to perform prediction subject to soft constraints .Page 1, “Abstract”
- Moreover, with a technique for performing inference given soft constraints , it is easy to automatically generate large families of constraints and learn their costs with a simple convex optimization problem during training.Page 1, “Abstract”
- On the other hand, recent work has demonstrated improvements in citation field extraction by imposing soft constraints (Chang et al., 2012).Page 1, “Introduction”
- This paper introduces a novel method for imposing soft constraints via dual decomposition.Page 2, “Introduction”
- We also propose a method for learning the penalties the prediction problem incurs for violating these soft constraints .Page 2, “Introduction”
- The overall modeling technique we employ is to add soft constraints to a simple model for which we have an existing efficient prediction algorithm.Page 2, “Background”
- We now introduce an extension of Algorithm 1 to handle soft constraints .Page 3, “Soft Constraints in Dual Decomposition”
- Note that when performing MAP subject to soft constraints , optimal solutions might not satisfy some constraints, since doing so would reduce the model’s score by too much.Page 3, “Soft Constraints in Dual Decomposition”
- In other words, the dual problem can not penalize the violation of a constraint more than the soft constraint model in the primal would penalize you if you violated it.Page 3, “Soft Constraints in Dual Decomposition”
- Algorithm 2 Soft-DD: projected subgradient for dual decomposition with soft constraintsPage 4, “Soft Constraints in Dual Decomposition”

See all papers in *Proc. ACL 2014* that mention soft constraints.

See all papers in *Proc. ACL* that mention soft constraints.

Back to top.

Appears in 19 sentences as: CRF (22)

In *Learning Soft Linear Constraints with Application to Citation Field Extraction*

- For this underlying model, we employ a chain-structured conditional random field ( CRF ), since CRFs have been shown to perform better than other simple unconstrained models like hidden markov models for citation extraction (Peng and McCallum, 2004).Page 2, “Background”
- The MAP inference task in a CRF be can expressed as an optimization problem with a lin-Page 2, “Background”
- Since the log probability of some 3/ in the CRF is proportional to sum of the scores of all the factors, we can concatenate the indicator variables as a vector y and the scores as a vector 21) and write the MAP problem asPage 2, “Background”
- Since we truncate penalties at 0, this suggests that we will learn a penalty of 0 for constraints in three categories: constraints that do not hold in the ground truth, constraints that hold in the ground truth but are satisfied in practice by performing inference in the base CRF model, and constraints that are satisfied in practice as a side-effect of imposing nonzero penalties on some other constraints .Page 4, “Soft Constraints in Dual Decomposition”
- Here, 3/], represents an output tag of the CRF , SO if = = 1, then we have that 3/], was given a label with index i.Page 5, “Citation Extraction Data”
- We constrain the output labeling of the chain-structured CRF to be a valid BIO encoding.Page 5, “Citation Extraction Data”
- Rather than enforcing these constraints using dual decomposition, they can be enforced directly when performing MAP inference in the CRF by modifying the dynamic program of the Viterbi algorithm to only allow valid pairs of adj a-cent labels.Page 5, “Citation Extraction Data”
- Here, prediction is done with respect to the base chain-structured CRF tagger and does not include global constraints.Page 6, “Citation Extraction Data”
- Here, yd denotes the ground truth labeling and wd is the vector of scores for the CRF tagger.Page 6, “Citation Extraction Data”
- While Gibbs sampling has been shown to work well tasks such as named entity recognition (Finkel et al., 2005), our previous experiments show that it does not work well for citation extraction, where it found only low-quality solutions in practice because the sampling did not mix well, even on a simple chain-structured CRF .Page 7, “Citation Extraction Data”
- Later, CRFs were shown to perform better on CORA, improving the results from the Hmm’s token-level F1 of 86.6 to 91.5 with a CRF (Peng and McCallum, 2004).Page 7, “Citation Extraction Data”

See all papers in *Proc. ACL 2014* that mention CRF.

See all papers in *Proc. ACL* that mention CRF.

Back to top.

Appears in 10 sentences as: ground truth (12)

In *Learning Soft Linear Constraints with Application to Citation Field Extraction*

- Intuitively, the perceptron update increases the penalty for a constraint if it is satisfied in the ground truth and not in an inferred prediction, and decreases the penalty if the constraint is satisfied in the prediction and not the ground truth .Page 4, “Soft Constraints in Dual Decomposition”
- Since we truncate penalties at 0, this suggests that we will learn a penalty of 0 for constraints in three categories: constraints that do not hold in the ground truth, constraints that hold in the ground truth but are satisfied in practice by performing inference in the base CRF model, and constraints that are satisfied in practice as a side-effect of imposing nonzero penalties on some other constraints .Page 4, “Soft Constraints in Dual Decomposition”
- Our score compares how often the constraint is Violated in the ground truth examples versus our predictions.Page 6, “Citation Extraction Data”
- Note that it may make sense to consider a constraint that is sometimes Violated in the ground truth , as the penalty learning algorithm can learn a small penalty for it, whichPage 6, “Citation Extraction Data”
- Here, yd denotes the ground truth labeling and wd is the vector of scores for the CRF tagger.Page 6, “Citation Extraction Data”
- A value of imp(c) above 1 implies that the constraint is more violated on the predicted examples than on the ground truth , and hence that we might want to keep it.Page 6, “Citation Extraction Data”
- The importance score of a constraint provides information about how often it is violated by the CRF, but holds in the ground truth , and a nonzero penalty implies we enforce it as a soft constraint at test time.Page 8, “Citation Extraction Data”
- The above example constraints are almost always satisfied on the ground truth , and would be useful to enforce as hard constraints.Page 9, “Citation Extraction Data”
- However, there are a number of learned constraints that are often violated on the ground truth but are still useful as soft constraints.Page 9, “Citation Extraction Data”
- These constraints are moderately violated on ground truth examples, however.Page 9, “Citation Extraction Data”

See all papers in *Proc. ACL 2014* that mention ground truth.

See all papers in *Proc. ACL* that mention ground truth.

Back to top.

Appears in 6 sentences as: optimization problem (5) optimization problems (1)

In *Learning Soft Linear Constraints with Application to Citation Field Extraction*

- Moreover, with a technique for performing inference given soft constraints, it is easy to automatically generate large families of constraints and learn their costs with a simple convex optimization problem during training.Page 1, “Abstract”
- The MAP inference task in a CRF be can expressed as an optimization problem with a lin-Page 2, “Background”
- Furthermore, a subgradient of D(A) is Ay* — b, for an y* which maximizes this inner optimization problem .Page 3, “Background”
- Consider the optimization problems of the form: max.Page 3, “Soft Constraints in Dual Decomposition”
- This optimization problem can still be solved with projected subgradient descent and is depicted in Algorithm 2.Page 3, “Soft Constraints in Dual Decomposition”
- Each penalty Ci has to be nonnegative; otherwise, the optimization problem in equation (5) is ill-defined.Page 4, “Soft Constraints in Dual Decomposition”

See all papers in *Proc. ACL 2014* that mention optimization problem.

See all papers in *Proc. ACL* that mention optimization problem.

Back to top.

Appears in 5 sentences as: CRFs (5)

In *Learning Soft Linear Constraints with Application to Citation Field Extraction*

- Hidden Markov models and linear-chain conditional random fields ( CRFs ) have previously been applied to citation extraction (Hetzner, 2008; Peng and McCallum, 2004) .Page 1, “Introduction”
- For this underlying model, we employ a chain-structured conditional random field (CRF), since CRFs have been shown to perform better than other simple unconstrained models like hidden markov models for citation extraction (Peng and McCallum, 2004).Page 2, “Background”
- The algorithms we present in later sections for handling soft global constraints and for learning the penalties of these constraints can be applied to general structured linear models, not just CRFs , provided we have an available algorithm for performing MAP inference.Page 2, “Background”
- Later, CRFs were shown to perform better on CORA, improving the results from the Hmm’s token-level F1 of 86.6 to 91.5 with a CRF(Peng and McCallum, 2004).Page 7, “Citation Extraction Data”
- This approach is limited in its use of an HMM as an underlying model, as it has been shown that CRFs perform significantly better, achieving 95.37 token-level accuracy on CORA (Peng and McCallum, 2004).Page 7, “Citation Extraction Data”

See all papers in *Proc. ACL 2014* that mention CRFs.

See all papers in *Proc. ACL* that mention CRFs.

Back to top.

Appears in 4 sentences as: development set (4)

In *Learning Soft Linear Constraints with Application to Citation Field Extraction*

- We found it beneficial, though it is not theoretically necessary, to learn the constraints on a held-out development set , separately from the other model parameters, as during training most constraints are satisfied due to overfitting, which leads to an underestimation of the relevant penalties.Page 4, “Soft Constraints in Dual Decomposition”
- There are 660 citations in the development set and 367 citation in the test set.Page 4, “Citation Extraction Data”
- We then use the development set to learn the penalties for the soft constraints, using the perceptron algorithm described in section 3.1.Page 7, “Citation Extraction Data”
- We instantiate constraints from each template in section 5.1, iterating over all possible labels that contain a B prefix at any level in the hierarchy and pruning all constraints with imp(c) < 2.75 calculated on the development set .Page 7, “Citation Extraction Data”

See all papers in *Proc. ACL 2014* that mention development set.

See all papers in *Proc. ACL* that mention development set.

Back to top.

Appears in 4 sentences as: F1 score (3) F1 scores (1)

In *Learning Soft Linear Constraints with Application to Citation Field Extraction*

- Table 1: Set of constraints learned and F1 scores .Page 6, “Citation Extraction Data”
- This final feature improves the F1 score on the cleaned test set from 94.0 F1 to 94.44 F1, which we use as a baseline score.Page 7, “Citation Extraction Data”
- We asses performance in terms of field-level F1 score , which is the harmonic mean of precision and recall for predicted segments.Page 7, “Citation Extraction Data”
- Table 1 shows how each type of constraint family improved the F1 score on the dataset.Page 7, “Citation Extraction Data”

See all papers in *Proc. ACL 2014* that mention F1 score.

See all papers in *Proc. ACL* that mention F1 score.

Back to top.

Appears in 4 sentences as: perceptron (4)

In *Learning Soft Linear Constraints with Application to Citation Field Extraction*

- All we need to employ the structured perceptron algorithm (Collins, 2002) or the structured SVM algorithm (Tsochantaridis et al., 2004) is a black-box procedure for performing MAP inference in the structured linear model given an arbitrary cost vector.Page 4, “Soft Constraints in Dual Decomposition”
- This can be ensured by simple modifications of the perceptron and subgradient descent optimization of the structured SVM objective simply by truncating c coordinate-wise to be nonnegative at every learning iteration.Page 4, “Soft Constraints in Dual Decomposition”
- Intuitively, the perceptron update increases the penalty for a constraint if it is satisfied in the ground truth and not in an inferred prediction, and decreases the penalty if the constraint is satisfied in the prediction and not the ground truth.Page 4, “Soft Constraints in Dual Decomposition”
- We then use the development set to learn the penalties for the soft constraints, using the perceptron algorithm described in section 3.1.Page 7, “Citation Extraction Data”

See all papers in *Proc. ACL 2014* that mention perceptron.

See all papers in *Proc. ACL* that mention perceptron.

Back to top.

Appears in 3 sentences as: graphical model (3)

In *Learning Soft Linear Constraints with Application to Citation Field Extraction*

- Here, we define a binary indicator variable for each candidate setting of each factor in the graphical model .Page 2, “Background”
- There are multiple previous examples of augmenting chain-structured sequence models with terms capturing global relationships by expanding the chain to a more complex graphical model with nonlocal dependencies between the outputs.Page 6, “Citation Extraction Data”
- Soft constraints can be implemented inefficiently using hard constraints and dual decomposition— by introducing copies of output variables and an auxiliary graphical model , as in Rush et al.Page 7, “Citation Extraction Data”

See all papers in *Proc. ACL 2014* that mention graphical model.

See all papers in *Proc. ACL* that mention graphical model.

Back to top.

Appears in 3 sentences as: SVM (3)

In *Learning Soft Linear Constraints with Application to Citation Field Extraction*

- All we need to employ the structured perceptron algorithm (Collins, 2002) or the structured SVM algorithm (Tsochantaridis et al., 2004) is a black-box procedure for performing MAP inference in the structured linear model given an arbitrary cost vector.Page 4, “Soft Constraints in Dual Decomposition”
- This can be ensured by simple modifications of the perceptron and subgradient descent optimization of the structured SVM objective simply by truncating c coordinate-wise to be nonnegative at every learning iteration.Page 4, “Soft Constraints in Dual Decomposition”
- A similar analysis holds for the structured SVM ap-Page 4, “Soft Constraints in Dual Decomposition”

See all papers in *Proc. ACL 2014* that mention SVM.

See all papers in *Proc. ACL* that mention SVM.

Back to top.