Bayesian Synchronous Tree-Substitution Grammar Induction and Its Application to Sentence Compression
Yamangil, Elif and Shieber, Stuart M.

Article Structure

Abstract

We describe our experiments with training algorithms for tree-to-tree synchronous tree-substitution grammar (STSG) for monolingual translation tasks such as sentence compression and paraphrasing.

Introduction

Given an aligned corpus of tree pairs, we might want to learn a mapping between the paired trees.

Sentence compression

Sentence compression is the task of summarizing a sentence while retaining most of the informational content and remaining grammatical (Jing, 2000).

The STSG Model

Synchronous tree-substitution grammar is a formalism for synchronously generating a pair of non-isomorphic source and target trees (Eisner, 2003).

Evaluation

We compared the Gibbs sampling compressor (GS) against a version of maximum a posteriori EM (with Dirichlet parameter greater than 1) and a discriminative STSG based on SVM training (Cohn and Lapata, 2008) (SVM).

Conclusion

We explored nonparametric Bayesian learning of non-isomorphic tree mappings using Dirichlet process priors.

Topics

sentence compression

Appears in 15 sentences as: Sentence compression (1) sentence compression (14)
In Bayesian Synchronous Tree-Substitution Grammar Induction and Its Application to Sentence Compression
  1. We describe our experiments with training algorithms for tree-to-tree synchronous tree-substitution grammar (STSG) for monolingual translation tasks such as sentence compression and paraphrasing.
    Page 1, “Abstract”
  2. We formalize nonparametric Bayesian STSG with epsilon alignment in full generality, and provide a Gibbs sampling algorithm for posterior inference tailored to the task of extractive sentence compression .
    Page 1, “Abstract”
  3. Such induction of tree mappings has application in a variety of natural-language-processing tasks including machine translation, paraphrase, and sentence compression .
    Page 1, “Introduction”
  4. In this work, we explore techniques for inducing synchronous tree-substitution grammars (STSG) using as a testbed application extractive sentence compression .
    Page 1, “Introduction”
  5. In this work, we use an extension of the aforementioned models of generative segmentation for STSG induction, and describe an algorithm for posterior inference under this model that is tailored to the task of extractive sentence compression .
    Page 2, “Introduction”
  6. In the following, we define the task of extractive sentence compression and the Bayesian STSG model, and algorithms we used for inference and prediction.
    Page 2, “Introduction”
  7. We then describe the experiments in extractive sentence compression and present our results in contrast with alternative algorithms.
    Page 2, “Introduction”
  8. Sentence compression is the task of summarizing a sentence while retaining most of the informational content and remaining grammatical (Jing, 2000).
    Page 2, “Sentence compression”
  9. In extractive sentence compression , which we focus on in this paper, an order-preserving subset of the words in the sentence are selected to form the summary, that is, we summarize by deleting words (Knight and Marcu, 2002).
    Page 2, “Sentence compression”
  10. In supervised sentence compression , the goal is to generalize from a parallel training corpus of sentences (source) and their compressions (target) to unseen sentences in a test set to predict their compressions.
    Page 3, “Sentence compression”
  11. Our sampling updates are extensions of those used by Cohn and Blunsom (2009) in MT, but are tailored to our task of extractive sentence compression .
    Page 5, “The STSG Model”

See all papers in Proc. ACL 2010 that mention sentence compression.

See all papers in Proc. ACL that mention sentence compression.

Back to top.

SVM

Appears in 10 sentences as: SVM (11)
In Bayesian Synchronous Tree-Substitution Grammar Induction and Its Application to Sentence Compression
  1. We achieve substantial improvements against a number of baselines including EM, support vector machine ( SVM ) based discriminative training, and variational Bayes (VB).
    Page 2, “Introduction”
  2. We compared the Gibbs sampling compressor (GS) against a version of maximum a posteriori EM (with Dirichlet parameter greater than 1) and a discriminative STSG based on SVM training (Cohn and Lapata, 2008) ( SVM ).
    Page 6, “Evaluation”
  3. EM is a natural benchmark, while SVM is also appropriate since it can be taken as the state of the art for our task.4
    Page 6, “Evaluation”
  4. Nonetheless, because the comparison system is a generalization of the extractive SVM compressor of Cohn and Lapata (2007), we do not expect that the results would differ qualitatively.
    Page 6, “Evaluation”
  5. SVM EM GS
    Page 7, “Evaluation”
  6. SVM EM GS Gold
    Page 7, “Evaluation”
  7. In our experiments with the publicly available SVM system we used all except paraphrasal rules extracted from bilingual corpora (Cohn and Lapata, 2008).
    Page 7, “Evaluation”
  8. EM used a subset of the rules extracted by SVM , namely all rules except non-head deleting compression rules, and was initialized uniformly.
    Page 7, “Evaluation”
  9. We asked 15 self-reported native English speakers for their judgments of GS, EM, and SVM output sentences and the gold standard in terms of grammaticality (how fluent the compression is) and importance (how much of the meaning of and important information from the original sentence is retained) on a scale of 1 (worst) to 5 (best).
    Page 7, “Evaluation”
  10. EM and SVM perform at very similar levels, which we attribute to using the same set of rules, while GS performs at a level substantially better than both, and much closer to human performance in both criteria.
    Page 7, “Evaluation”

See all papers in Proc. ACL 2010 that mention SVM.

See all papers in Proc. ACL that mention SVM.

Back to top.

overfitting

Appears in 7 sentences as: overfitting (7)
In Bayesian Synchronous Tree-Substitution Grammar Induction and Its Application to Sentence Compression
  1. In summary, previous methods suffer from problems of narrowness of search, having to restrict the space of possible rules, and overfitting in preferring overly specific grammars.
    Page 2, “Introduction”
  2. We pursue the use of hierarchical probabilistic models incorporating sparse priors to simultaneously solve both the narrowness and overfitting problems.
    Page 2, “Introduction”
  3. Segmentation is achieved by introducing a prior bias towards grammars that are compact representations of the data, namely by enforcing simplicity and sparsity: preferring simple rules (smaller segments) unless the use of a complex rule is evidenced by the data (through repetition), and thus mitigating the overfitting problem.
    Page 2, “Introduction”
  4. (Eisner, 2003) However, as noted earlier, EM is subject to the narrowness and overfitting problems.
    Page 4, “The STSG Model”
  5. EM gives a strong baseline since it already uses rules that are limited in depth and number of frontier nodes by stipulation, helping with the overfitting we have mentioned, surprisingly outperforming its discriminative counterpart in both precision and recall (and consequently RelFl).
    Page 7, “Evaluation”
  6. We conclude that the mitigation of the two factors (narrowness and overfitting ) both contribute to the performance gain of GS.5
    Page 8, “Evaluation”
  7. Our investigation with variational Bayes showed that the improvement is due both to finding sparse grammars (mitigating overfitting ) and to searching over the space of all grammars (mitigating narrowness).
    Page 9, “Conclusion”

See all papers in Proc. ACL 2010 that mention overfitting.

See all papers in Proc. ACL that mention overfitting.

Back to top.

word alignments

Appears in 6 sentences as: word alignment (1) word alignments (5)
In Bayesian Synchronous Tree-Substitution Grammar Induction and Its Application to Sentence Compression
  1. These translation tasks are characterized by the relative ability to commit to parallel parse trees and availability of word alignments , yet the unavailability of large-scale data, calling for a Bayesian tree-to-tree formalism.
    Page 1, “Abstract”
  2. One approach is to use word alignments (where these can be reliably estimated, as in our testbed application) to align subtrees and extract rules (Och and Ney, 2004; Galley et al., 2004) but this leaves open the question of finding the right level of generality of the rules — how deep the rules should be and how much lexicalization they should involve — necessitating resorting to heuristics such as minimality of rules, and leading to
    Page 1, “Introduction”
  3. possibility of searching over the infinite space of grammars (and, in machine translation, possible word alignments ), thus sidestepping the narrowness problem outlined above as well.
    Page 2, “Introduction”
  4. This task is characterized by the availability of word alignments , providing a clean testbed for investigating the effects of grammar extraction.
    Page 2, “Introduction”
  5. In particular, we visit every tree pair and each of its source nodes i, and update its alignment by selecting between and within two choices: (a) unaligned, (b) aligned with some target node j or e. The number of possibilities j in (b) is significantly limited, firstly by the word alignment (for instance, a source node dominating a deleted subspan cannot be aligned with a target node), and secondly by the current alignment of other nearby aligned source nodes.
    Page 5, “The STSG Model”
  6. The future for this work would involve natural extensions such as mixing over the space of word alignments ; this would allow application to MT—like tasks where flexible word reordering is allowed, such as abstractive sentence compression and paraphrasing.
    Page 9, “Conclusion”

See all papers in Proc. ACL 2010 that mention word alignments.

See all papers in Proc. ACL that mention word alignments.

Back to top.

Gibbs sampling

Appears in 4 sentences as: Gibbs sampling (4)
In Bayesian Synchronous Tree-Substitution Grammar Induction and Its Application to Sentence Compression
  1. We formalize nonparametric Bayesian STSG with epsilon alignment in full generality, and provide a Gibbs sampling algorithm for posterior inference tailored to the task of extractive sentence compression.
    Page 1, “Abstract”
  2. 3.2 Posterior inference via Gibbs sampling
    Page 5, “The STSG Model”
  3. We use Gibbs sampling (Geman and Geman, 1984), a Markov chain Monte Carlo (MCMC) method, to sample from the posterior (3).
    Page 5, “The STSG Model”
  4. We compared the Gibbs sampling compressor (GS) against a version of maximum a posteriori EM (with Dirichlet parameter greater than 1) and a discriminative STSG based on SVM training (Cohn and Lapata, 2008) (SVM).
    Page 6, “Evaluation”

See all papers in Proc. ACL 2010 that mention Gibbs sampling.

See all papers in Proc. ACL that mention Gibbs sampling.

Back to top.

gold standard

Appears in 4 sentences as: gold standard (4)
In Bayesian Synchronous Tree-Substitution Grammar Induction and Its Application to Sentence Compression
  1. The compression rate for the gold standard was 65.67%.
    Page 7, “Evaluation”
  2. As an automated metric of quality, we compute F-score based on grammatical relations (relational F1, or RelFl) (Riezler et al., 2003), by which the consistency between the set of predicted grammatical relations and those from the gold standard is measured, which has been shown by Clarke and Lapata (2006b) to correlate reliably with human judgments.
    Page 7, “Evaluation”
  3. For all three systems we obtained predictions for the test set and used the Stanford parser to extract grammatical relations from predicted trees and the gold standard .
    Page 7, “Evaluation”
  4. We asked 15 self-reported native English speakers for their judgments of GS, EM, and SVM output sentences and the gold standard in terms of grammaticality (how fluent the compression is) and importance (how much of the meaning of and important information from the original sentence is retained) on a scale of 1 (worst) to 5 (best).
    Page 7, “Evaluation”

See all papers in Proc. ACL 2010 that mention gold standard.

See all papers in Proc. ACL that mention gold standard.

Back to top.

hyperparameters

Appears in 4 sentences as: Hyperparameters (1) hyperparameters (3)
In Bayesian Synchronous Tree-Substitution Grammar Induction and Its Application to Sentence Compression
  1. The hyperparameters do can be incorporated into the generative model as random variables; however, we opt to fix these at various constants to investigate different levels of sparsity.
    Page 4, “The STSG Model”
  2. Assuming fixed hyperparameters a 2 {ac} and ,6 2 {BC}, our inference problem is to find the posterior distribution of the derivation sequences
    Page 5, “The STSG Model”
  3. All hyperparameters ac, BC were held constant at 04, 6 for simplicity and were fit using grid-search over 04 E [10—6,106],6 E [10—3,0.5].
    Page 7, “Evaluation”
  4. Hyperparameters were handled the same way as for GS.
    Page 8, “Evaluation”

See all papers in Proc. ACL 2010 that mention hyperparameters.

See all papers in Proc. ACL that mention hyperparameters.

Back to top.

sentence pair

Appears in 4 sentences as: sentence pair (2) sentence pairs (2)
In Bayesian Synchronous Tree-Substitution Grammar Induction and Its Application to Sentence Compression
  1. An example sentence pair , which we use as a running example, is the following:
    Page 2, “Sentence compression”
  2. See Figure l for an example of how an STSG with these rules would operate in synchronously generating our example sentence pair .
    Page 3, “The STSG Model”
  3. This corpus consists of 1370 sentence pairs that were manually created from transcribed Broadcast News stories.
    Page 6, “Evaluation”
  4. a pattern of compression used many times in the BNC in sentence pairs such as “NPR’s Anne Gar-rels reports” / “Anne Garrels reports”.
    Page 9, “Evaluation”

See all papers in Proc. ACL 2010 that mention sentence pair.

See all papers in Proc. ACL that mention sentence pair.

Back to top.

lexicalization

Appears in 3 sentences as: lexicalization (2) lexicalized (1)
In Bayesian Synchronous Tree-Substitution Grammar Induction and Its Application to Sentence Compression
  1. One approach is to use word alignments (where these can be reliably estimated, as in our testbed application) to align subtrees and extract rules (Och and Ney, 2004; Galley et al., 2004) but this leaves open the question of finding the right level of generality of the rules — how deep the rules should be and how much lexicalization they should involve — necessitating resorting to heuristics such as minimality of rules, and leading to
    Page 1, “Introduction”
  2. Second, the ability to have rules deeper than one level provides a principled way of modeling lexicalization , whose importance has been emphasized (Galley and McKeown, 2007; Yamangil and Nelken, 2008).
    Page 3, “The STSG Model”
  3. Of especial interest are deep lexicalized rules such as
    Page 9, “Evaluation”

See all papers in Proc. ACL 2010 that mention lexicalization.

See all papers in Proc. ACL that mention lexicalization.

Back to top.

machine translation

Appears in 3 sentences as: machine translation (3)
In Bayesian Synchronous Tree-Substitution Grammar Induction and Its Application to Sentence Compression
  1. Such induction of tree mappings has application in a variety of natural-language-processing tasks including machine translation , paraphrase, and sentence compression.
    Page 1, “Introduction”
  2. Such models have been used as generative solutions to several other segmentation problems, ranging from word segmentation (Goldwater et al., 2006), to parsing (Cohn et al., 2009; Post and Gildea, 2009) and machine translation (DeNero et al., 2008; Cohn and Blunsom, 2009; Liu and Gildea, 2009).
    Page 2, “Introduction”
  3. possibility of searching over the infinite space of grammars (and, in machine translation , possible word alignments), thus sidestepping the narrowness problem outlined above as well.
    Page 2, “Introduction”

See all papers in Proc. ACL 2010 that mention machine translation.

See all papers in Proc. ACL that mention machine translation.

Back to top.

translation tasks

Appears in 3 sentences as: translation tasks (3)
In Bayesian Synchronous Tree-Substitution Grammar Induction and Its Application to Sentence Compression
  1. We describe our experiments with training algorithms for tree-to-tree synchronous tree-substitution grammar (STSG) for monolingual translation tasks such as sentence compression and paraphrasing.
    Page 1, “Abstract”
  2. These translation tasks are characterized by the relative ability to commit to parallel parse trees and availability of word alignments, yet the unavailability of large-scale data, calling for a Bayesian tree-to-tree formalism.
    Page 1, “Abstract”
  3. Overall, we take these results as being encouraging for STSG induction via Bayesian nonparametrics for monolingual translation tasks .
    Page 9, “Conclusion”

See all papers in Proc. ACL 2010 that mention translation tasks.

See all papers in Proc. ACL that mention translation tasks.

Back to top.