Jointly Learning to Extract and Compress
Berg-Kirkpatrick, Taylor and Gillick, Dan and Klein, Dan

Article Structure

Abstract

We learn a joint model of sentence extraction and compression for multi-document summarization.

Introduction

Applications of machine learning to automatic summarization have met with limited success, and, as a result, many top-performing systems remain largely ad-hoc.

Joint Model

We focus on the task of multi-document summarization.

Structured Learning

Discriminative training attempts to minimize the loss incurred during prediction by optimizing an objective on the training set.

Efficient Prediction

We show how to perform prediction with the extractive and compressive models by solving ILPs.

Data

We use the data from the Text Analysis Conference (TAC) evaluations from 2008 and 2009, a total of 92 multi-document summarization problems.

Features

Here we describe the features used to parameterize our model.

Experiments

7.1 Experimental setup

Conclusion

Jointly learning to extract and compress within a unified model outperforms learning pure extraction, which in turn outperforms a state-of-the-art extractive baseline.

Topics

bigram

Appears in 36 sentences as: Bigram (3) bigram (24) bigrams (17)
In Jointly Learning to Extract and Compress
  1. While past extractive methods have assigned value to individual sentences and then explicitly represented the notion of redundancy (Carbonell and Goldstein, 1998), recent methods show greater success by using a simpler notion of coverage: bigrams
    Page 2, “Joint Model”
  2. Note that there is intentionally a bigram missing from (a).
    Page 2, “Joint Model”
  3. contribute content, and redundancy is implicitly encoded in the fact that redundant sentences cover fewer bigrams (Nenkova and Vanderwende, 2005; Gillick and Favre, 2009).
    Page 2, “Joint Model”
  4. Here, vb is the value of bigram b, and B (y) is the set of bigrams present in the summary encoded by y. Gillick and Favre (2009) produced a state-of—the—art system1 by directly optimizing this objective.
    Page 2, “Joint Model”
  5. They let the value vb of each bigram be given by the number of input documents the bigram appears in.
    Page 2, “Joint Model”
  6. We extend objective 1 so that it assigns value not just to the bigrams that appear in the summary, but also to the choices made in the creation of the summary.
    Page 2, “Joint Model”
  7. This entails to parameterizing each bigram score 2);, and each subtree deletion score 210.
    Page 3, “Joint Model”
  8. We use bigram recall as our loss function (see Section 3.3).
    Page 4, “Structured Learning”
  9. Luckily, our choice of loss function, bigram recall, factors over bigrams .
    Page 4, “Structured Learning”
  10. We simply modify each bigram value 2);, to include bigram b’s contribution to the total loss.
    Page 4, “Structured Learning”
  11. We verified that bigram recall correlates well with ROUGE and with manual metrics.
    Page 4, “Structured Learning”

See all papers in Proc. ACL 2011 that mention bigram.

See all papers in Proc. ACL that mention bigram.

Back to top.

ILP

Appears in 18 sentences as: ILP (17) ILPs (4)
In Jointly Learning to Extract and Compress
  1. Inference in our model can be cast as an ILP and thereby solved in reasonable time; we also present a fast approximation scheme which achieves similar performance.
    Page 1, “Abstract”
  2. Inference in our model can be cast as an integer linear program (ILP) and solved in reasonable time using a generic ILP solver; we also introduce a fast approximation scheme which achieves similar performance.
    Page 2, “Introduction”
  3. We show how to perform prediction with the extractive and compressive models by solving ILPs .
    Page 5, “Efficient Prediction”
  4. For many instances, a generic ILP solver can find exact solutions to the prediction problems in a matter of seconds.
    Page 5, “Efficient Prediction”
  5. 4.1 ILP for extraction
    Page 5, “Efficient Prediction”
  6. Gillick and Favre (2009) express the optimization of Objective 1 for extractive summarization as an ILP .
    Page 5, “Efficient Prediction”
  7. They specify the following ILP over binary variables y and z:
    Page 5, “Efficient Prediction”
  8. Solving the ILP is fast in practice.
    Page 5, “Efficient Prediction”
  9. 4.2 ILP for joint compression and extraction
    Page 5, “Efficient Prediction”
  10. We can extend the ILP formulation of extraction to solve the compressive problem.
    Page 5, “Efficient Prediction”
  11. Figure 2: Diagram of ILP for joint extraction and compression.
    Page 5, “Efficient Prediction”

See all papers in Proc. ACL 2011 that mention ILP.

See all papers in Proc. ACL that mention ILP.

Back to top.

extractive systems

Appears in 8 sentences as: EXTRACTIVE system (1) extractive system (1) extractive systems (6)
In Jointly Learning to Extract and Compress
  1. For example, Zajic et al (2006) use a pipeline approach, preprocessing to yield additional candidates for extraction by applying heuristic sentence compressions, but their system does not outperform state-of-the-art purely extractive systems .
    Page 1, “Introduction”
  2. A second contribution of the current work is to show a system for jointly learning to jointly compress and extract that exhibits gains in both ROUGE and content metrics over purely extractive systems .
    Page 1, “Introduction”
  3. learns parameters for compression and extraction jointly using an approximate training procedure, but his results are not competitive with state-of-the-art extractive systems , and he does not report improvements on manual content or quality metrics.
    Page 2, “Introduction”
  4. Learning weights for Objective 1 where Y(:c) is the set of extractive summaries gives our LEARNED EXTRACTIVE system .
    Page 3, “Joint Model”
  5. To train the extractive system described in Section 2, we use as our labels y* the extractions with the largest bigram recall values relative to the sets of references.
    Page 6, “Data”
  6. But, importantly, the gains achieved by the joint extractive and compressive system in content-based metrics do not come at the cost of linguistic quality when compared to purely extractive systems .
    Page 8, “Experiments”
  7. The joint extractive and compressive system fits more word types into a summary than the extractive systems , but also produces longer sentences on average.
    Page 8, “Experiments”
  8. Reading the output summaries more carefully suggests that by learning to extract and compress jointly, our joint system has the flexibility to use or create reasonable, medium-length sentences, whereas the extractive systems are stuck with a few valuable long sentences, but several less productive shorter sentences.
    Page 8, “Experiments”

See all papers in Proc. ACL 2011 that mention extractive systems.

See all papers in Proc. ACL that mention extractive systems.

Back to top.

loss function

Appears in 5 sentences as: Loss function (1) loss function (4)
In Jointly Learning to Extract and Compress
  1. We will perform discriminative training using a loss function that directly measures end-to-end summarization quality.
    Page 3, “Structured Learning”
  2. We use bigram recall as our loss function (see Section 3.3).
    Page 4, “Structured Learning”
  3. Luckily, our choice of loss function , bigram recall, factors over bigrams.
    Page 4, “Structured Learning”
  4. 3.3 Loss function
    Page 4, “Structured Learning”
  5. With an extractive reference denoted by y*, our loss function is:
    Page 4, “Structured Learning”

See all papers in Proc. ACL 2011 that mention loss function.

See all papers in Proc. ACL that mention loss function.

Back to top.

constituency parses

Appears in 4 sentences as: constituency parse (1) Constituency parses (1) constituency parses (2)
In Jointly Learning to Extract and Compress
  1. In our complete model, which jointly extracts and compresses sentences, we choose whether or not to cut individual subtrees in the constituency parses
    Page 2, “Joint Model”
  2. Assume a constituency parse 758 for every sentence 3.
    Page 3, “Joint Model”
  3. While we use constituency parses rather than dependency parses, this model has similarities with the vine-growth model of Daume III (2006).
    Page 3, “Joint Model”
  4. Constituency parses were produced using the Berkeley parser (Petrov and Klein, 2007).
    Page 8, “Experiments”

See all papers in Proc. ACL 2011 that mention constituency parses.

See all papers in Proc. ACL that mention constituency parses.

Back to top.

joint model

Appears in 4 sentences as: joint model (3) joint model: (1)
In Jointly Learning to Extract and Compress
  1. We learn a joint model of sentence extraction and compression for multi-document summarization.
    Page 1, “Abstract”
  2. Learning weights for Objective 2 where Y(x) is the set of compressive summaries, and C (y) the set of broken edges that produce subtree deletions, gives our LEARNED COMPRESSIVE system, which is our joint model of extraction and compression.
    Page 3, “Joint Model”
  3. By solving the following ILP we can compute the arg max required for prediction in the joint model:
    Page 5, “Efficient Prediction”
  4. Figure 4: Example summaries produced by our learned joint model of extraction and compression.
    Page 9, “Experiments”

See all papers in Proc. ACL 2011 that mention joint model.

See all papers in Proc. ACL that mention joint model.

Back to top.

subtrees

Appears in 4 sentences as: subtrees (4)
In Jointly Learning to Extract and Compress
  1. In our complete model, which jointly extracts and compresses sentences, we choose whether or not to cut individual subtrees in the constituency parses
    Page 2, “Joint Model”
  2. This ensures that only subtrees may be deleted.
    Page 3, “Joint Model”
  3. Variables 20 indicate edges in the parse tree that have been cut in order to remove subtrees .
    Page 5, “Efficient Prediction”
  4. Constraint 9 encodes the requirement that only full subtrees may be deleted.
    Page 6, “Efficient Prediction”

See all papers in Proc. ACL 2011 that mention subtrees.

See all papers in Proc. ACL that mention subtrees.

Back to top.

parse tree

Appears in 3 sentences as: parse tree (4)
In Jointly Learning to Extract and Compress
  1. Variables yn indicate the presence of parse tree nodes.
    Page 2, “Joint Model”
  2. We represent a compressive summary as a vector y = (yn : n E 258, s E c) of indicators, one for each nonterminal node in each parse tree of the sentences in the document set c. A word is present in the output summary if and only if its parent parse tree node n has yn = 1 (see Figure lb).
    Page 3, “Joint Model”
  3. Variables 20 indicate edges in the parse tree that have been cut in order to remove subtrees.
    Page 5, “Efficient Prediction”

See all papers in Proc. ACL 2011 that mention parse tree.

See all papers in Proc. ACL that mention parse tree.

Back to top.

SVM

Appears in 3 sentences as: SVM (3)
In Jointly Learning to Extract and Compress
  1. We use a soft-margin support vector machine ( SVM ) (Vapnik, 1998) objective over the full structured output space (Taskar et al., 2003; Tsochantaridis et al., 2004) of extractive and compressive summaries:
    Page 4, “Structured Learning”
  2. In our application, this approach efficiently solves the structured SVM training problem up to some specified tolerance 6.
    Page 4, “Structured Learning”
  3. Thus, if loss-augmented prediction turns up no new constraints on a given iteration, the current solution to the reduced problem, w and E, is the solution to the full SVM training problem.
    Page 4, “Structured Learning”

See all papers in Proc. ACL 2011 that mention SVM.

See all papers in Proc. ACL that mention SVM.

Back to top.

Viterbi

Appears in 3 sentences as: Viterbi (3)
In Jointly Learning to Extract and Compress
  1. In Section 4 we show that finding summaries that optimize Objective 2, Viterbi prediction, is efficient.
    Page 3, “Structured Learning”
  2. Online learning algorithms like perceptron or the margin-infused relaxed algorithm (MIRA) (Crammer and Singer, 2003) are frequently used for structured problems where Viterbi inference is available.
    Page 3, “Structured Learning”
  3. Thus, we can easily perform loss-augmented prediction using the same procedure we use to perform Viterbi prediction (described in Section 4).
    Page 4, “Structured Learning”

See all papers in Proc. ACL 2011 that mention Viterbi.

See all papers in Proc. ACL that mention Viterbi.

Back to top.