Large-Scale Syntactic Language Modeling with Treelets
Pauls, Adam and Klein, Dan

Article Structure

Abstract

We propose a simple generative, syntactic language model that conditions on overlapping windows of tree context (or treelets) in the same way that n-gram language models condition on overlapping windows of linear context.

Introduction

N -gram language models are a central component of all speech recognition and machine translation systems, and a great deal of research centers around refining models (Chen and Goodman, 1998), efficient storage (Pauls and Klein, 2011; Heafield, 2011), and integration into decoders (Koehn, 2004; Chiang, 2005).

Treelet Language Modeling

The common denominator of most n-gram language models is that they assign probabilities roughly according to empirical frequencies for observed 77.-grams, but fall back to distributions conditioned on smaller contexts for unobserved n-grams, as shown in Figure 1(a).

Tree Transformations

In the previous section, we described how to condition on rich parse context to better capture the distribution of English trees.

Scoring a Sentence

Computing the probability of a sentence wf under our model requires summing over all possible parses of w?

Experiments

We evaluate our model along several dimensions.

Conclusion

We have presented a simple syntactic language model that can be estimated using standard n-gram smoothing techniques on large amounts of data.

Topics

language model

Appears in 31 sentences as: language model (15) language modeling (3) language models (15)
In Large-Scale Syntactic Language Modeling with Treelets
  1. We propose a simple generative, syntactic language model that conditions on overlapping windows of tree context (or treelets) in the same way that n-gram language models condition on overlapping windows of linear context.
    Page 1, “Abstract”
  2. We estimate the parameters of our model by collecting counts from automatically parsed text using standard n-gram language model estimation techniques, allowing us to train a model on over one billion tokens of data using a single machine in a matter of hours.
    Page 1, “Abstract”
  3. N -gram language models are a central component of all speech recognition and machine translation systems, and a great deal of research centers around refining models (Chen and Goodman, 1998), efficient storage (Pauls and Klein, 2011; Heafield, 2011), and integration into decoders (Koehn, 2004; Chiang, 2005).
    Page 1, “Introduction”
  4. At the same time, because n-gram language models only condition on a local window of linear word-level context, they are poor models of long-range syntactic dependencies.
    Page 1, “Introduction”
  5. Although several lines of work have proposed generative syntactic language models that improve on n-gram models for moderate amounts of data (Chelba, 1997; Xu et al., 2002; Charniak, 2001; Hall, 2004; Roark,
    Page 1, “Introduction”
  6. 2004), these models have only recently been scaled to the impressive amounts of data routinely used by 71- gram language models (Tan et al., 2011).
    Page 1, “Introduction”
  7. In this paper, we describe a generative, syntactic language model that conditions on local context treelets1 in a parse tree, backing off to smaller treelets as necessary.
    Page 1, “Introduction”
  8. The simplicity of our approach also contrasts with recent work on language modeling with tree substitution grammars (Post and Gildea, 2009), where larger treelet contexts are incorporated by using sophisticated priors to learn a segmentation of parse trees.
    Page 1, “Introduction”
  9. Instead, we build upon the success of n-gram language models , which do not assume a segmentation and instead score all overlapping contexts.
    Page 1, “Introduction”
  10. The common denominator of most n-gram language models is that they assign probabilities roughly according to empirical frequencies for observed 77.-grams, but fall back to distributions conditioned on smaller contexts for unobserved n-grams, as shown in Figure 1(a).
    Page 2, “Treelet Language Modeling”
  11. to use back-off-based smoothing for syntactic language modeling — such techniques have been applied to models that condition on headword contexts (Charniak, 2001; Roark, 2004; Zhang, 2009).
    Page 3, “Treelet Language Modeling”

See all papers in Proc. ACL 2012 that mention language model.

See all papers in Proc. ACL that mention language model.

Back to top.

n-gram

Appears in 15 sentences as: n-gram (16)
In Large-Scale Syntactic Language Modeling with Treelets
  1. We propose a simple generative, syntactic language model that conditions on overlapping windows of tree context (or treelets) in the same way that n-gram language models condition on overlapping windows of linear context.
    Page 1, “Abstract”
  2. We estimate the parameters of our model by collecting counts from automatically parsed text using standard n-gram language model estimation techniques, allowing us to train a model on over one billion tokens of data using a single machine in a matter of hours.
    Page 1, “Abstract”
  3. At the same time, because n-gram language models only condition on a local window of linear word-level context, they are poor models of long-range syntactic dependencies.
    Page 1, “Introduction”
  4. Although several lines of work have proposed generative syntactic language models that improve on n-gram models for moderate amounts of data (Chelba, 1997; Xu et al., 2002; Charniak, 2001; Hall, 2004; Roark,
    Page 1, “Introduction”
  5. Our model can be trained simply by collecting counts and using the same smoothing techniques normally applied to n-gram models (Kneser and Ney, 1995), enabling us to apply techniques developed for scaling 71- gram models out of the box (Brants et al., 2007; Pauls and Klein, 2011).
    Page 1, “Introduction”
  6. Instead, we build upon the success of n-gram language models, which do not assume a segmentation and instead score all overlapping contexts.
    Page 1, “Introduction”
  7. The common denominator of most n-gram language models is that they assign probabilities roughly according to empirical frequencies for observed 77.-grams, but fall back to distributions conditioned on smaller contexts for unobserved n-grams, as shown in Figure 1(a).
    Page 2, “Treelet Language Modeling”
  8. As in the n-gram case, we would like to pick h to be large enough to capture relevant dependencies, but small enough that we can obtain meaningful estimates from data.
    Page 2, “Treelet Language Modeling”
  9. Although it is tempting to think that we can replace the left-to-right generation of n-gram models with the purely top-down generation of typical PCFGs, in practice, words are often highly predictive of the words that follow them — indeed, n-gram models would be terrible language models if this were not the case.
    Page 3, “Treelet Language Modeling”
  10. As with n-gram models, counts for rule yields conditioned on r’ are sparse, and we must choose an appropriate back-off strategy.
    Page 3, “Treelet Language Modeling”
  11. Estimating the probabilities in our model can be done very simply using the same techniques (in fact, the same code) used to estimate n-gram language models.
    Page 3, “Treelet Language Modeling”

See all papers in Proc. ACL 2012 that mention n-gram.

See all papers in Proc. ACL that mention n-gram.

Back to top.

Treebank

Appears in 11 sentences as: Treebank (10) treebank (2)
In Large-Scale Syntactic Language Modeling with Treelets
  1. There is one additional hurdle in the estimation of our model: while there exist corpora with human-annotated constituency parses like the Penn Treebank (Marcus et al., 1993), these corpora are quite small — on the order of millions of tokens — and we cannot gather nearly as many counts as we can for 77.-grams, for which billions or even trillions (Brants et al., 2007) of tokens are available on the Web.
    Page 3, “Treelet Language Modeling”
  2. Figure 2: A sample parse from the Penn Treebank after the tree transformations described in Section 3.
    Page 4, “Tree Transformations”
  3. number of transformations of Treebank constituency parses that allow us to capture such dependencies.
    Page 4, “Tree Transformations”
  4. Although the Penn Treebank annotates temporal N Ps, most off-the-shelf parsers do not retain these tags, and we do not assume their presence.
    Page 4, “Tree Transformations”
  5. Instead, we mark any noun that is the head of a NP-TMP constituent at least once in the Treebank as a temporal noun, so for example today would be tagged as NN T and months would be tagged as N N TS.
    Page 4, “Tree Transformations”
  6. In the Treebank , chains of verbs (e. g. will be going) have a separate VP for each verb.
    Page 4, “Tree Transformations”
  7. Because we only assume the presence of automatically derived parses, which do not produce the empty elements in the original Treebank , we must identify such elements on our own.
    Page 4, “Tree Transformations”
  8. In Table l, we show the first four samples of length between 15 and 20 generated from our model and a 5- gram model trained on the Penn Treebank .
    Page 5, “Experiments”
  9. For training data, we constructed a large treebank by concatenating the WSJ and Brown portions of the Penn Treebank , the 50K BLLIP training sentences from Post (2011), and the AFP and APW portions of English Gigaword version 3 (Graff, 2003), totaling about 1.3 billion tokens.
    Page 5, “Experiments”
  10. We used the human-annotated parses for the sentences in the Penn Treebank , but parsed the Gigaword and BLLIP sentences with the Berkeley Parser.
    Page 5, “Experiments”
  11. We parsed the 50K positive training examples of Post (2011) with the Berkeley Parser and used the resulting treebank to train a treelet language model.
    Page 6, “Experiments”

See all papers in Proc. ACL 2012 that mention Treebank.

See all papers in Proc. ACL that mention Treebank.

Back to top.

machine translation

Appears in 9 sentences as: Machine Translation (2) machine translation (7)
In Large-Scale Syntactic Language Modeling with Treelets
  1. We also show fluency improvements in a preliminary machine translation experiment.
    Page 1, “Abstract”
  2. N -gram language models are a central component of all speech recognition and machine translation systems, and a great deal of research centers around refining models (Chen and Goodman, 1998), efficient storage (Pauls and Klein, 2011; Heafield, 2011), and integration into decoders (Koehn, 2004; Chiang, 2005).
    Page 1, “Introduction”
  3. We also show fluency improvements in a preliminary machine translation reranking experiment.
    Page 2, “Introduction”
  4. For machine translation , a model that builds target-side constituency parses, such as that of Galley et a1.
    Page 4, “Scoring a Sentence”
  5. We report machine translation reranking results in Section 5.4.
    Page 5, “Experiments”
  6. (2004) and Cherry and Quirk (2008) both use the l-best output of a machine translation system.
    Page 6, “Experiments”
  7. 5.3.3 Machine Translation Classification
    Page 8, “Experiments”
  8. (2004) and Cherry and Quirk (2008) in evaluating our language models on their ability to distinguish the l-best output of a machine translation system from a reference translation in a pairwise fashion.
    Page 8, “Experiments”
  9. 5.4 Machine Translation Fluency
    Page 8, “Experiments”

See all papers in Proc. ACL 2012 that mention machine translation.

See all papers in Proc. ACL that mention machine translation.

Back to top.

BLEU

Appears in 6 sentences as: BLEU (7)
In Large-Scale Syntactic Language Modeling with Treelets
  1. The BLEU scores for these outputs are 32.7, 27.8, and 20.8.
    Page 8, “Experiments”
  2. In particular, their translations had a lower BLEU score, making their task easier.
    Page 8, “Experiments”
  3. We see that our system prefers the reference much more often than the S-GRAM language model.11 However, we also note that the easiness of the task is correlated with the quality of translations (as measured in BLEU score).
    Page 8, “Experiments”
  4. We did not find that the use of our syntactic language model made any statistically significant increases in BLEU score.
    Page 8, “Experiments”
  5. However, we noticed in general that the translations favored by our model were more fluent, a useful improvement to which BLEU is often insensitive.
    Page 8, “Experiments”
  6. Although these two hypothesis sets had the same BLEU score (up to statistical significance), the Turkers preferred the output obtained using our syntactic language model 59% of the time, indicating that our model had managed to pick out more fluent hypotheses that nonetheless were of the same BLEU score.
    Page 8, “Experiments”

See all papers in Proc. ACL 2012 that mention BLEU.

See all papers in Proc. ACL that mention BLEU.

Back to top.

Penn Treebank

Appears in 6 sentences as: Penn Treebank (6)
In Large-Scale Syntactic Language Modeling with Treelets
  1. There is one additional hurdle in the estimation of our model: while there exist corpora with human-annotated constituency parses like the Penn Treebank (Marcus et al., 1993), these corpora are quite small — on the order of millions of tokens — and we cannot gather nearly as many counts as we can for 77.-grams, for which billions or even trillions (Brants et al., 2007) of tokens are available on the Web.
    Page 3, “Treelet Language Modeling”
  2. Figure 2: A sample parse from the Penn Treebank after the tree transformations described in Section 3.
    Page 4, “Tree Transformations”
  3. Although the Penn Treebank annotates temporal N Ps, most off-the-shelf parsers do not retain these tags, and we do not assume their presence.
    Page 4, “Tree Transformations”
  4. In Table l, we show the first four samples of length between 15 and 20 generated from our model and a 5- gram model trained on the Penn Treebank .
    Page 5, “Experiments”
  5. For training data, we constructed a large treebank by concatenating the WSJ and Brown portions of the Penn Treebank , the 50K BLLIP training sentences from Post (2011), and the AFP and APW portions of English Gigaword version 3 (Graff, 2003), totaling about 1.3 billion tokens.
    Page 5, “Experiments”
  6. We used the human-annotated parses for the sentences in the Penn Treebank , but parsed the Gigaword and BLLIP sentences with the Berkeley Parser.
    Page 5, “Experiments”

See all papers in Proc. ACL 2012 that mention Penn Treebank.

See all papers in Proc. ACL that mention Penn Treebank.

Back to top.

BLEU score

Appears in 5 sentences as: BLEU score (5) BLEU scores (1)
In Large-Scale Syntactic Language Modeling with Treelets
  1. The BLEU scores for these outputs are 32.7, 27.8, and 20.8.
    Page 8, “Experiments”
  2. In particular, their translations had a lower BLEU score , making their task easier.
    Page 8, “Experiments”
  3. We see that our system prefers the reference much more often than the S-GRAM language model.11 However, we also note that the easiness of the task is correlated with the quality of translations (as measured in BLEU score ).
    Page 8, “Experiments”
  4. We did not find that the use of our syntactic language model made any statistically significant increases in BLEU score .
    Page 8, “Experiments”
  5. Although these two hypothesis sets had the same BLEU score (up to statistical significance), the Turkers preferred the output obtained using our syntactic language model 59% of the time, indicating that our model had managed to pick out more fluent hypotheses that nonetheless were of the same BLEU score .
    Page 8, “Experiments”

See all papers in Proc. ACL 2012 that mention BLEU score.

See all papers in Proc. ACL that mention BLEU score.

Back to top.

generative models

Appears in 5 sentences as: generative model (1) generative models (4)
In Large-Scale Syntactic Language Modeling with Treelets
  1. Table 2: Perplexity of several generative models on Section 0 of the WSJ.
    Page 6, “Experiments”
  2. Our model outperforms all other generative models , though the improvement over the 71- gram model is not statistically significant.
    Page 6, “Experiments”
  3. We would like to use our model to make grammaticality judgements, but as a generative model it can only provide us with probabilities.
    Page 6, “Experiments”
  4. In Table 4, we also show the performance of the generative models trained on our 1B corpus.
    Page 7, “Experiments”
  5. All generative models improve, but TREELET-RULE remains the best, now outperforming the RERANK system, though of course it is likely that RERANK would improve if it could be scaled up to more training data.
    Page 7, “Experiments”

See all papers in Proc. ACL 2012 that mention generative models.

See all papers in Proc. ACL that mention generative models.

Back to top.

reranking

Appears in 5 sentences as: RERANK (3) reranking (4)
In Large-Scale Syntactic Language Modeling with Treelets
  1. We also show fluency improvements in a preliminary machine translation reranking experiment.
    Page 2, “Introduction”
  2. We report machine translation reranking results in Section 5.4.
    Page 5, “Experiments”
  3. The latter report results for two binary classifiers: RERANK uses the reranking features of Charniak and Johnson (2005), and TSG uses
    Page 6, “Experiments”
  4. All generative models improve, but TREELET-RULE remains the best, now outperforming the RERANK system, though of course it is likely that RERANK would improve if it could be scaled up to more training data.
    Page 7, “Experiments”
  5. We also carried out reranking experiments on 1000-best lists from Moses using our syntactic language model as a feature.
    Page 8, “Experiments”

See all papers in Proc. ACL 2012 that mention reranking.

See all papers in Proc. ACL that mention reranking.

Back to top.

statistically significant

Appears in 5 sentences as: statistical significance (1) statistically significant (4)
In Large-Scale Syntactic Language Modeling with Treelets
  1. The differences between scores marked with l are not statistically significant .
    Page 6, “Experiments”
  2. Our model outperforms all other generative models, though the improvement over the 71- gram model is not statistically significant .
    Page 6, “Experiments”
  3. We did not find that the use of our syntactic language model made any statistically significant increases in BLEU score.
    Page 8, “Experiments”
  4. Although these two hypothesis sets had the same BLEU score (up to statistical significance ), the Turkers preferred the output obtained using our syntactic language model 59% of the time, indicating that our model had managed to pick out more fluent hypotheses that nonetheless were of the same BLEU score.
    Page 8, “Experiments”
  5. This result was statistically significant with p < 0.001 using bootstrap resampling.
    Page 8, “Experiments”

See all papers in Proc. ACL 2012 that mention statistically significant.

See all papers in Proc. ACL that mention statistically significant.

Back to top.

Berkeley Parser

Appears in 4 sentences as: Berkeley Parser (4)
In Large-Scale Syntactic Language Modeling with Treelets
  1. We used the human-annotated parses for the sentences in the Penn Treebank, but parsed the Gigaword and BLLIP sentences with the Berkeley Parser .
    Page 5, “Experiments”
  2. PCFG-LA The Berkeley Parser in language model mode.
    Page 5, “Experiments”
  3. 7We use signatures generated by the Berkeley Parser .
    Page 6, “Experiments”
  4. We parsed the 50K positive training examples of Post (2011) with the Berkeley Parser and used the resulting treebank to train a treelet language model.
    Page 6, “Experiments”

See all papers in Proc. ACL 2012 that mention Berkeley Parser.

See all papers in Proc. ACL that mention Berkeley Parser.

Back to top.

binary classification

Appears in 4 sentences as: binary classification (2) binary classifiers (2)
In Large-Scale Syntactic Language Modeling with Treelets
  1. The former train a latent PCFG support vector machine for binary classification (LSVM).
    Page 6, “Experiments”
  2. The latter report results for two binary classifiers : RERANK uses the reranking features of Charniak and Johnson (2005), and TSG uses
    Page 6, “Experiments”
  3. Indeed, the methods in Post (2011) are simple binary classifiers , and it is not clear that these models would be properly calibrated for any other task, such as integration in a decoder.
    Page 7, “Experiments”
  4. “Pairwise” accuracy is the fraction of correct sentences whose SLR score was higher than its noisy version, and “independent” refers to standard binary classification accuracy.
    Page 7, “Experiments”

See all papers in Proc. ACL 2012 that mention binary classification.

See all papers in Proc. ACL that mention binary classification.

Back to top.

constituency parses

Appears in 4 sentences as: constituency parsers (1) constituency parses (3)
In Large-Scale Syntactic Language Modeling with Treelets
  1. There is one additional hurdle in the estimation of our model: while there exist corpora with human-annotated constituency parses like the Penn Treebank (Marcus et al., 1993), these corpora are quite small — on the order of millions of tokens — and we cannot gather nearly as many counts as we can for 77.-grams, for which billions or even trillions (Brants et al., 2007) of tokens are available on the Web.
    Page 3, “Treelet Language Modeling”
  2. However, we can use one of several high-quality constituency parsers (Collins, 1997; Charniak, 2000; Petrov et al., 2006) to automatically generate parses.
    Page 3, “Treelet Language Modeling”
  3. number of transformations of Treebank constituency parses that allow us to capture such dependencies.
    Page 4, “Tree Transformations”
  4. For machine translation, a model that builds target-side constituency parses , such as that of Galley et a1.
    Page 4, “Scoring a Sentence”

See all papers in Proc. ACL 2012 that mention constituency parses.

See all papers in Proc. ACL that mention constituency parses.

Back to top.

translation system

Appears in 4 sentences as: translation system (3) translation systems (1)
In Large-Scale Syntactic Language Modeling with Treelets
  1. N -gram language models are a central component of all speech recognition and machine translation systems , and a great deal of research centers around refining models (Chen and Goodman, 1998), efficient storage (Pauls and Klein, 2011; Heafield, 2011), and integration into decoders (Koehn, 2004; Chiang, 2005).
    Page 1, “Introduction”
  2. (2004) and Cherry and Quirk (2008) both use the l-best output of a machine translation system .
    Page 6, “Experiments”
  3. Cherry and Quirk (2008) report an accuracy of 71.9% on a similar experiment with German a source language, though the translation system and training data were different so the numbers are not comparable.
    Page 8, “Experiments”
  4. (2004) and Cherry and Quirk (2008) in evaluating our language models on their ability to distinguish the l-best output of a machine translation system from a reference translation in a pairwise fashion.
    Page 8, “Experiments”

See all papers in Proc. ACL 2012 that mention translation system.

See all papers in Proc. ACL that mention translation system.

Back to top.

models trained

Appears in 3 sentences as: model trained (1) models trained (2)
In Large-Scale Syntactic Language Modeling with Treelets
  1. In Table l, we show the first four samples of length between 15 and 20 generated from our model and a 5- gram model trained on the Penn Treebank.
    Page 5, “Experiments”
  2. Table 5: Classification accuracies on the noisy WSJ for models trained on WSJ Sections 2—21 and our 1B token corpus.
    Page 7, “Experiments”
  3. In Table 4, we also show the performance of the generative models trained on our 1B corpus.
    Page 7, “Experiments”

See all papers in Proc. ACL 2012 that mention models trained.

See all papers in Proc. ACL that mention models trained.

Back to top.

unigram

Appears in 3 sentences as: unigram (3)
In Large-Scale Syntactic Language Modeling with Treelets
  1. p(w|P, R, r’, w_1, w_2) to p(w|P, R, r’, w_1) and then p(w|P, R, r’ From there, we back off to p(w|P, R) where R is the sibling immediately to the right of P, then to a raw PCFG p(w|P), and finally to a unigram distribution.
    Page 3, “Treelet Language Modeling”
  2. We used a simple measure for isolating the syntactic likelihood of a sentence: we take the log-probability under our model and subtract the log-probability under a unigram model, then normalize by the length of the sentence.8 This measure, which we call the syntactic log-odds ratio (SLR), is a crude way of “subtracting out” the semantic component of the generative probability, so that sentences that use rare words are not penalized for doing so.
    Page 6, “Experiments”
  3. (2004) also report using a parser probability normalized by the unigram probability (but not length), and did not find it effective.
    Page 6, “Experiments”

See all papers in Proc. ACL 2012 that mention unigram.

See all papers in Proc. ACL that mention unigram.

Back to top.