Approximation Strategies for Multi-Structure Sentence Compression

Sentence compression has been shown to benefit from joint inference involving both n-gram and dependency-factored objectives but this typically requires expensive integer programming.

Sentence compression is a text-to-text generation task in which an input sentence must be transformed into a shorter output sentence which accurately reflects the meaning in the input and also remains grammatically well-formed.

Even though compression is typically formulated as a token deletion task, it is evident that dropping tokens independently from an input sentence will likely not result in fluent and meaningful compressive text.

We ran compression experiments over the newswire (NW) and broadcast news transcription (BN) corpora compiled by Clarke and Lapata (2008) which contain gold compressions produced by human annotators using only word deletion.

Sentence compression is one of the better-studied text-to-text generation problems and has been observed to play a significant role in human summarization (Jing, 2000; Jing and McKeown, 2000).

We have presented approximate inference strategies to jointly compress sentences under bigram and dependency-factored objectives by exploiting the modularity of the task and considering the two subproblems in isolation.

Appears in 19 sentences as: ILP (17) ILPs (3)

In *Approximation Strategies for Multi-Structure Sentence Compression*

- Joint methods have also been proposed that invoke integer linear programming ( ILP ) formulations to simultaneously consider multiple structural inference problems—both over n-grams and input dependencies (Martins and Smith, 2009) or n-grams and all possible dependencies (Thadani and McKeown, 2013).Page 1, “Introduction”
- However, it is well-established that the utility of ILP for optimal inference in structured problems is often outweighed by the worst-case performance of ILP solvers on large problems without unique integral solutions.Page 1, “Introduction”
- In this work, we develop approximate inference strategies to the joint approach of Thadani and McKeown (2013) which trade the optimality guarantees of exact ILP for faster inference by separately solving the n-gram and dependency subproblems and using Lagrange multipliers to enforce consistency between their solutions.Page 2, “Introduction”
- Our proposed approximation strategies are evaluated using automated metrics in order to address the question: under what conditions should a real-world sentence compression system implementation consider exact inference with an ILP or approximate inference?Page 2, “Introduction”
- The primary advantage of this technique is the ability to leverage the underlying structure of the problems in inference rather than relying on a generic ILP formulation while still often producing exact solutions.Page 3, “Multi-Structure Sentence Compression”
- Even if ILP-based approaches perform reasonably at the scale of single-sentence compression problems, the exponential worst-case complexity of general-purpose ILPs will inevitably pose challenges when scaling up to (a) handle larger inputs, (b) use higher-order structural fragments, or (c) incorporate additional models.Page 3, “Multi-Structure Sentence Compression”
- In order to produce a solution to this subproblem, we use an LP relaxation of the relevant portion of the ILP from Thadani and McKeown (2013) by omitting integer constraints over the token and dependency variables in x and 2 respectively.Page 5, “Multi-Structure Sentence Compression”
- For simplicity, however, we describe the ILP version rather than the relaxed LP in order to motivate the constraints with their intended purpose rather than their effect in the relaxed prob-Page 5, “Multi-Structure Sentence Compression”
- Figure 2: An illustration of commodity values for a valid solution of the non-relaxed ILP .Page 5, “Multi-Structure Sentence Compression”
- These constraints ensure that cyclic structures are not possible in the non-relaxed ILP .Page 5, “Multi-Structure Sentence Compression”
- Figure 2 illustrates how these commodity flow variables constrain the output of the ILP to be a tree.Page 5, “Multi-Structure Sentence Compression”

See all papers in *Proc. ACL 2014* that mention ILP.

See all papers in *Proc. ACL* that mention ILP.

Back to top.

Appears in 13 sentences as: Bigram (1) bigram (8) bigrams (6)

In *Approximation Strategies for Multi-Structure Sentence Compression*

- Following this, §2.3 discusses a dynamic program to find maximum weight bigram subsequences from the input sentence, while §2.4 covers LP relaxation-based approaches for approximating solutions to the problem of finding a maximum-weight subtree in a graph of potential output dependencies.Page 2, “Multi-Structure Sentence Compression”
- C. In addition, we define bigram indicator variables yij E {0, l} to represent whether a particular order-preserving bigram2 (ti, tj> from S is present as a contiguous bigram in C as well as dependency indicator variables zij E {0, 1} corresponding to whether the dependency arc ti —> 253- is present in the dependency parse of C. The score for a given compression 0 can now be defined to factor over its tokens, n-grams and dependencies as follows.Page 3, “Multi-Structure Sentence Compression”
- where Qtok, ngr and 6dep are feature-based scoring functions for tokens, bigrams and dependencies respectively.Page 3, “Multi-Structure Sentence Compression”
- where the incidence vector x é (MEET represents an entire token configuration over T, with y and 2 defined analogously to represent configurations of bigrams and dependencies.Page 3, “Multi-Structure Sentence Compression”
- The structural configurations y and z are non-degenerate, i.e, the bigram configuration y represents an acyclic path while the dependency configuration 2 forms a tree.Page 3, “Multi-Structure Sentence Compression”
- 2Although Thadani and McKeown (2013) is not restricted to bigrams or order-preserving n- grams, we limit our discussion to this scenario as it also fits the assumptions of McDonald (2006) and the datasets of Clarke and Lapata (2006).Page 3, “Multi-Structure Sentence Compression”
- Consider once more the optimization problem characterized by (2) The two structural problems that need to be solved in this formulation are the extraction of a maximum-weight acyclic subsequence of bigrams y from the lattice of all order-preserving bigrams from S and the recovery of a maximum-weight directed subtree 2.Page 4, “Multi-Structure Sentence Compression”
- 2.3 Bigram subsequencesPage 4, “Multi-Structure Sentence Compression”
- McDonald (2006) provides a Viterbi-like dynamic programming algorithm to recover the highest-scoring sequence of order-preserving bigrams from a lattice, either in unconstrained form or with a specific length constraint.Page 4, “Multi-Structure Sentence Compression”
- o The bigram subproblem is guaranteed to return a well-formed integral solution which obeys the imposed compression rate, so we are assured of a source of valid—if non-optimal—solutions in line 13 of Algorithm 1.Page 6, “Multi-Structure Sentence Compression”
- It is straightforward to recover and score a bigram configuration y from a token configuration ,6 However, scoring solutions produced by the dynamic program from §2.3 also requires the score over a corresponding parse tree; this can be recovered by constructing a dependency subgraph containing across only the tokens that are active in a(y) and retrieving the maximum spanning tree for that subgraph using the Chu-Liu Edmonds algorithm.Page 6, “Multi-Structure Sentence Compression”

See all papers in *Proc. ACL 2014* that mention bigram.

See all papers in *Proc. ACL* that mention bigram.

Back to top.

Appears in 9 sentences as: spanning tree (9) spanning trees (1)

In *Approximation Strategies for Multi-Structure Sentence Compression*

- We recover approximate solutions to this problem using LP relaxation and maximum spanning tree algorithms, yielding techniques that can be combined with the efficient bigram-based inference approach using Lagrange multipliers.Page 1, “Abstract”
- Maximum spanning tree algorithms, commonly used in non-projective dependency parsing (McDonald et al., 2005), are not easily adaptable to this task since the maximum-weight subtree is not necessarily a part of the maximum spanning tree .Page 2, “Introduction”
- In addition, we can recover approximate solutions to this problem by using the Chu-Liu Edmonds algorithm for recovering maximum spanning trees (Chu and Liu, 1965; Edmonds, 1967) over the relatively sparse subgraph defined by a solution to the relaxed LP.Page 2, “Introduction”
- Figure 1: An example of the difficulty of recovering the maximum-weight subtree (B—>C, B—>D) from the maximum spanning tree (A—>C, C—>B, B—>D).Page 5, “Multi-Structure Sentence Compression”
- Although the maximum spanning tree for a given token configuration can be recovered efficiently, Figure 1 illustrates that the maximum-scoring subtree is not necessarily found within it.Page 5, “Multi-Structure Sentence Compression”
- Since the commodity flow constraints in (9)—(ll) ensure a connected i, it is therefore possible to recover a maximum-weight spanning tree from G using the Chu-Liu Edmonds algorithm (Chu and Liu, 1965; Edmonds, 1967).5 Although the runtime of this algorithm is cubic in the size of the input graph, it is fairly speedy when applied on relatively sparse graphs such as the solutions to the LP described above.Page 6, “Multi-Structure Sentence Compression”
- The resulting spanning tree is a useful integral approximation of i but, as mentioned previously, may contain more nodes than R due to fractional values in i; we therefore repeatedly prune leavesPage 6, “Multi-Structure Sentence Compression”
- It is straightforward to recover and score a bigram configuration y from a token configuration ,6 However, scoring solutions produced by the dynamic program from §2.3 also requires the score over a corresponding parse tree; this can be recovered by constructing a dependency subgraph containing across only the tokens that are active in a(y) and retrieving the maximum spanning tree for that subgraph using the Chu-Liu Edmonds algorithm.Page 6, “Multi-Structure Sentence Compression”
- As discussed in §2.4, a maximum spanning tree is recovered from the output of the LP and greedily pruned in order to generate a valid integral solution while observing the imposed compression rate.Page 7, “Experiments”

See all papers in *Proc. ACL 2014* that mention spanning tree.

See all papers in *Proc. ACL* that mention spanning tree.

Back to top.

Appears in 8 sentences as: Sentence compression (3) sentence compression (5)

In *Approximation Strategies for Multi-Structure Sentence Compression*

- Sentence compression has been shown to benefit from joint inference involving both n-gram and dependency-factored objectives but this typically requires expensive integer programming.Page 1, “Abstract”
- While dynamic programming is viable for bigram-based sentence compression , finding optimal compressed trees within graphs is NP-hard.Page 1, “Abstract”
- Sentence compression is a text-to-text generation task in which an input sentence must be transformed into a shorter output sentence which accurately reflects the meaning in the input and also remains grammatically well-formed.Page 1, “Introduction”
- Following an assumption often used in compression systems, the compressed output in this corpus is constructed by dropping tokens from the input sentence without any paraphrasing or reordering.1 A number of diverse approaches have been proposed for deletion-based sentence compression , including techniques that assemble the output text under an n-gram factorization over the input text (McDonald, 2006; Clarke and Lapata, 2008) or an arc factorization over input dependency parses (Filippova and Strube, 2008; Galanis and Androutsopoulos, 2010; Filippova and Altun, 2013).Page 1, “Introduction”
- Our proposed approximation strategies are evaluated using automated metrics in order to address the question: under what conditions should a real-world sentence compression system implementation consider exact inference with an ILP or approximate inference?Page 2, “Introduction”
- Following evaluations in machine translation as well as previous work in sentence compression (Unno et al., 2006; Clarke and Lapata, 2008; Martins and Smith, 2009; Napoles et al., 2011b; Thadani and McKeown, 2013), we evaluate system performance using F1 metrics over n-grams and dependency edges produced by parsing system output with RASP (Briscoe et al., 2006) and the Stanford parser.Page 7, “Experiments”
- Sentence compression is one of the better-studied text-to-text generation problems and has been observed to play a significant role in human summarization (Jing, 2000; Jing and McKeown, 2000).Page 9, “Related Work”
- Most approaches to sentence compression are supervised (Knight and Marcu, 2002; Riezler et al., 2003; Turner and Charniak, 2005; McDonald, 2006; Unno et al., 2006; Galley and McKeown, 2007; Nomoto, 2007; Cohn and Lapata, 2009; Galanis and Androutsopoulos, 2010; Gan-itkevitch et al., 2011; Napoles et al., 2011a; Filippova and Altun, 2013) following the release of datasets such as the Ziff-Davis corpus (Knight and Marcu, 2000) and the Edinburgh compression corpora (Clarke and Lapata, 2006; Clarke and Lapata, 2008), although unsupervised approaches—largely based on ILPs—have also received consideration (Clarke and Lapata, 2007; Clarke and Lapata, 2008; Filippova and Strube, 2008).Page 9, “Related Work”

See all papers in *Proc. ACL 2014* that mention sentence compression.

See all papers in *Proc. ACL* that mention sentence compression.

Back to top.

Appears in 7 sentences as: dynamic program (3) dynamic programming (4)

In *Approximation Strategies for Multi-Structure Sentence Compression*

- While dynamic programming is viable for bigram-based sentence compression, finding optimal compressed trees within graphs is NP-hard.Page 1, “Abstract”
- However, while the former problem can be solved efficiently using the dynamic programming approach of McDonald (2006), there are no efficient algorithms to recover maximum weighted non-projective subtrees in a general directed graph.Page 2, “Introduction”
- Following this, §2.3 discusses a dynamic program to find maximum weight bigram subsequences from the input sentence, while §2.4 covers LP relaxation-based approaches for approximating solutions to the problem of finding a maximum-weight subtree in a graph of potential output dependencies.Page 2, “Multi-Structure Sentence Compression”
- McDonald (2006) provides a Viterbi-like dynamic programming algorithm to recover the highest-scoring sequence of order-preserving bigrams from a lattice, either in unconstrained form or with a specific length constraint.Page 4, “Multi-Structure Sentence Compression”
- The latter requires a dynamic programming table [T] which represents the best score for a compression of length 7“ ending at token i.Page 4, “Multi-Structure Sentence Compression”
- It is straightforward to recover and score a bigram configuration y from a token configuration ,6 However, scoring solutions produced by the dynamic program from §2.3 also requires the score over a corresponding parse tree; this can be recovered by constructing a dependency subgraph containing across only the tokens that are active in a(y) and retrieving the maximum spanning tree for that subgraph using the Chu-Liu Edmonds algorithm.Page 6, “Multi-Structure Sentence Compression”
- 0 DP: The bigram-based dynamic program of McDonald (2006) described in §2.3.9Page 7, “Experiments”

See all papers in *Proc. ACL 2014* that mention dynamic programming.

See all papers in *Proc. ACL* that mention dynamic programming.

Back to top.

Appears in 7 sentences as: Maximum spanning (1) maximum spanning (7)

In *Approximation Strategies for Multi-Structure Sentence Compression*

- We recover approximate solutions to this problem using LP relaxation and maximum spanning tree algorithms, yielding techniques that can be combined with the efficient bigram-based inference approach using Lagrange multipliers.Page 1, “Abstract”
- Maximum spanning tree algorithms, commonly used in non-projective dependency parsing (McDonald et al., 2005), are not easily adaptable to this task since the maximum-weight subtree is not necessarily a part of the maximum spanning tree.Page 2, “Introduction”
- In addition, we can recover approximate solutions to this problem by using the Chu-Liu Edmonds algorithm for recovering maximum spanning trees (Chu and Liu, 1965; Edmonds, 1967) over the relatively sparse subgraph defined by a solution to the relaxed LP.Page 2, “Introduction”
- Figure 1: An example of the difficulty of recovering the maximum-weight subtree (B—>C, B—>D) from the maximum spanning tree (A—>C, C—>B, B—>D).Page 5, “Multi-Structure Sentence Compression”
- Although the maximum spanning tree for a given token configuration can be recovered efficiently, Figure 1 illustrates that the maximum-scoring subtree is not necessarily found within it.Page 5, “Multi-Structure Sentence Compression”
- It is straightforward to recover and score a bigram configuration y from a token configuration ,6 However, scoring solutions produced by the dynamic program from §2.3 also requires the score over a corresponding parse tree; this can be recovered by constructing a dependency subgraph containing across only the tokens that are active in a(y) and retrieving the maximum spanning tree for that subgraph using the Chu-Liu Edmonds algorithm.Page 6, “Multi-Structure Sentence Compression”
- As discussed in §2.4, a maximum spanning tree is recovered from the output of the LP and greedily pruned in order to generate a valid integral solution while observing the imposed compression rate.Page 7, “Experiments”

See all papers in *Proc. ACL 2014* that mention maximum spanning.

See all papers in *Proc. ACL* that mention maximum spanning.

Back to top.

Appears in 7 sentences as: n-gram (7)

In *Approximation Strategies for Multi-Structure Sentence Compression*

- Sentence compression has been shown to benefit from joint inference involving both n-gram and dependency-factored objectives but this typically requires expensive integer programming.Page 1, “Abstract”
- Following an assumption often used in compression systems, the compressed output in this corpus is constructed by dropping tokens from the input sentence without any paraphrasing or reordering.1 A number of diverse approaches have been proposed for deletion-based sentence compression, including techniques that assemble the output text under an n-gram factorization over the input text (McDonald, 2006; Clarke and Lapata, 2008) or an arc factorization over input dependency parses (Filippova and Strube, 2008; Galanis and Androutsopoulos, 2010; Filippova and Altun, 2013).Page 1, “Introduction”
- In this work, we develop approximate inference strategies to the joint approach of Thadani and McKeown (2013) which trade the optimality guarantees of exact ILP for faster inference by separately solving the n-gram and dependency subproblems and using Lagrange multipliers to enforce consistency between their solutions.Page 2, “Introduction”
- Let a(y) E {0, 1}” denote the incidence vector of tokens contained in the n-gram sequence y and ,6(z) E {0, 1}” denote the incidence vector of words contained in the dependency tree 2.Page 4, “Multi-Structure Sentence Compression”
- 0 ILP-Dep: A version of the joint ILP of Thadani and McKeown (2013) without n-gram variables and corresponding features.Page 7, “Experiments”
- Starting with the n-gram approaches, the performance of 3-LM leads us to observe that the gains of supervised learning far outweigh the utility of higher-order n- gram factorization, which is also responsible for a significant increase in wall-clock time.Page 8, “Experiments”
- We were surprised by the strong performance of the dependency-based inference techniques, which yielded results that approached the joint model in both n-gram and parse-based measures.Page 8, “Experiments”

See all papers in *Proc. ACL 2014* that mention n-gram.

See all papers in *Proc. ACL* that mention n-gram.

Back to top.

Appears in 5 sentences as: linear program (2) linear programming (3)

In *Approximation Strategies for Multi-Structure Sentence Compression*

- Experiments show that these approximation strategies produce results comparable to a state-of-the-art integer linear programming formulation for the same joint inference task along with a significant improvement in runtime.Page 1, “Abstract”
- Joint methods have also been proposed that invoke integer linear programming (ILP) formulations to simultaneously consider multiple structural inference problems—both over n-grams and input dependencies (Martins and Smith, 2009) or n-grams and all possible dependencies (Thadani and McKeown, 2013).Page 1, “Introduction”
- We therefore consider methods to recover approximate solutions for the subproblem of finding the maximum weighted subtree in a graph, common among which is the use of a linear programming relaxation.Page 2, “Introduction”
- This linear program (LP) appears empirically tight for compression problems and our experiments indicate that simply using the non-integral solutions of this LP in Lagrangian relaxation can empirically lead to reasonable compressions.Page 2, “Introduction”
- Experiments show that one of these approximation strategies produces results comparable to a state-of-the-art integer linear program for the same joint inference task with a 60% reduction in average inference time.Page 9, “Conclusion”

See all papers in *Proc. ACL 2014* that mention linear programming.

See all papers in *Proc. ACL* that mention linear programming.

Back to top.

Appears in 4 sentences as: dependency parse (1) dependency parses (2) dependency parsing (1)

In *Approximation Strategies for Multi-Structure Sentence Compression*

- Following an assumption often used in compression systems, the compressed output in this corpus is constructed by dropping tokens from the input sentence without any paraphrasing or reordering.1 A number of diverse approaches have been proposed for deletion-based sentence compression, including techniques that assemble the output text under an n-gram factorization over the input text (McDonald, 2006; Clarke and Lapata, 2008) or an arc factorization over input dependency parses (Filippova and Strube, 2008; Galanis and Androutsopoulos, 2010; Filippova and Altun, 2013).Page 1, “Introduction”
- Maximum spanning tree algorithms, commonly used in non-projective dependency parsing (McDonald et al., 2005), are not easily adaptable to this task since the maximum-weight subtree is not necessarily a part of the maximum spanning tree.Page 2, “Introduction”
- C. In addition, we define bigram indicator variables yij E {0, l} to represent whether a particular order-preserving bigram2 (ti, tj> from S is present as a contiguous bigram in C as well as dependency indicator variables zij E {0, 1} corresponding to whether the dependency arc ti —> 253- is present in the dependency parse of C. The score for a given compression 0 can now be defined to factor over its tokens, n-grams and dependencies as follows.Page 3, “Multi-Structure Sentence Compression”
- Gold dependency parses were approximated by running the Stanford dependency parser7 over reference compressions.Page 7, “Experiments”

See all papers in *Proc. ACL 2014* that mention dependency parses.

See all papers in *Proc. ACL* that mention dependency parses.

Back to top.

Appears in 4 sentences as: n-grams (5)

In *Approximation Strategies for Multi-Structure Sentence Compression*

- Joint methods have also been proposed that invoke integer linear programming (ILP) formulations to simultaneously consider multiple structural inference problems—both over n-grams and input dependencies (Martins and Smith, 2009) or n-grams and all possible dependencies (Thadani and McKeown, 2013).Page 1, “Introduction”
- C. In addition, we define bigram indicator variables yij E {0, l} to represent whether a particular order-preserving bigram2 (ti, tj> from S is present as a contiguous bigram in C as well as dependency indicator variables zij E {0, 1} corresponding to whether the dependency arc ti —> 253- is present in the dependency parse of C. The score for a given compression 0 can now be defined to factor over its tokens, n-grams and dependencies as follows.Page 3, “Multi-Structure Sentence Compression”
- Following evaluations in machine translation as well as previous work in sentence compression (Unno et al., 2006; Clarke and Lapata, 2008; Martins and Smith, 2009; Napoles et al., 2011b; Thadani and McKeown, 2013), we evaluate system performance using F1 metrics over n-grams and dependency edges produced by parsing system output with RASP (Briscoe et al., 2006) and the Stanford parser.Page 7, “Experiments”
- We report results over the following systems grouped into three categories of models: tokens + n-grams , tokens + dependencies, and joint models.Page 7, “Experiments”

See all papers in *Proc. ACL 2014* that mention n-grams.

See all papers in *Proc. ACL* that mention n-grams.

Back to top.

Appears in 3 sentences as: dependency tree (2) dependency trees (1)

In *Approximation Strategies for Multi-Structure Sentence Compression*

- Let a(y) E {0, 1}” denote the incidence vector of tokens contained in the n-gram sequence y and ,6(z) E {0, 1}” denote the incidence vector of words contained in the dependency tree 2.Page 4, “Multi-Structure Sentence Compression”
- Linear constraints are introduced to produce dependency structures that are close to the optimal dependency trees .Page 5, “Multi-Structure Sentence Compression”
- In order to avoid cycles in the dependency tree , we include additional variables to establish single-commodity flow (Magnanti and Wolsey, 1994) between all pairs of tokens.Page 5, “Multi-Structure Sentence Compression”

See all papers in *Proc. ACL 2014* that mention dependency tree.

See all papers in *Proc. ACL* that mention dependency tree.

Back to top.

Appears in 3 sentences as: subtrees (3)

In *Approximation Strategies for Multi-Structure Sentence Compression*

- However, while the former problem can be solved efficiently using the dynamic programming approach of McDonald (2006), there are no efficient algorithms to recover maximum weighted non-projective subtrees in a general directed graph.Page 2, “Introduction”
- 2.4 Dependency subtreesPage 5, “Multi-Structure Sentence Compression”
- In addition, to avoid producing multiple disconnected subtrees , only one dependency is permitted to attach to the ROOT pseudo-token.Page 5, “Multi-Structure Sentence Compression”

See all papers in *Proc. ACL 2014* that mention subtrees.

See all papers in *Proc. ACL* that mention subtrees.

Back to top.