How much do word embeddings encode about syntax?
Andreas, Jacob and Klein, Dan

Article Structure

Abstract

Do continuous word embeddings encode any useful information for constituency parsing?

Introduction

This paper investigates a variety of ways in which word embeddings might augment a constituency parser with a discrete state space.

Three possible benefits of word embeddings

We are interested in the question of whether a state-of-the-art discrete-variable constituency parser can be improved with word embeddings, and, more precisely, what aspect (or aspects) of the parser can be altered to make effective use of embeddings.

Parser extensions

For the experiments in this paper, we will use the Berkeley parser (Petrov and Klein, 2007) and the related Maryland parser (Huang and Harper, 2011).

Experimental setup

We use the Maryland implementation of the Berkeley parser as our baseline for the kernel-smoothed lexicon, and the Maryland featured parser as our baseline for the embedding-featured lexicon.1 For all experiments, we use 50-dimensional word embeddings.

Results

Various model-specific experiments are shown in Table 1.

Conclusion

With the goal of exploring how much useful syntactic information is provided by unsupervised word embeddings, we have presented three variations on a state-of-the-art parsing model, with extensions to the out-of-vocabulary model, lexicon, and feature set.

Topics

embeddings

Appears in 38 sentences as: Embeddings (2) embeddings (37)
In How much do word embeddings encode about syntax?
  1. Do continuous word embeddings encode any useful information for constituency parsing?
    Page 1, “Abstract”
  2. We isolate three ways in which word embeddings might augment a state-of-the-art statistical parser: by connecting out-of-vocabulary words to known ones, by encouraging common behavior among related in-vocabulary words, and by directly providing features for the lexicon.
    Page 1, “Abstract”
  3. Despite small gains on extremely small supervised training sets, we find that extra information from embeddings appears to make little or no difference to a parser with adequate training data.
    Page 1, “Abstract”
  4. Our results support an overall hypothesis that word embeddings import syntactic information that is ultimately redundant with distinctions learned from tree-banks in other ways.
    Page 1, “Abstract”
  5. This paper investigates a variety of ways in which word embeddings might augment a constituency parser with a discrete state space.
    Page 1, “Introduction”
  6. While word embeddings can be constructed directly from surface distributional statistics, as in LSA, more sophisticated tools for unsupervised extraction of word representations have recently gained popularity (Collobert et al., 2011; Mikolov et al., 2013a).
    Page 1, “Introduction”
  7. (Turian et al., 2010) have been shown to benefit from the inclusion of word embeddings as features.
    Page 1, “Introduction”
  8. In the other direction, access to a syntactic parse has been shown to be useful for constructing word embeddings for phrases composi-tionally (Hermann and Blunsom, 2013; Andreas and Ghahramani, 2013).
    Page 1, “Introduction”
  9. Dependency parsers have seen gains from distributional statistics in the form of discrete word clusters (Koo et al., 2008), and recent work (Bansal et al., 2014) suggests that similar gains can be derived from embeddings like the ones used in this paper.
    Page 1, “Introduction”
  10. It has been less clear how (and indeed whether) word embeddings in and of themselves are useful for constituency parsing.
    Page 1, “Introduction”
  11. There certainly exist competitive parsers that internally represent lexical items as real-valued vectors, such as the neural network-based parser of Henderson (2004), and even parsers which use pre-trained word embeddings to represent the lexicon, such as Socher et al.
    Page 1, “Introduction”

See all papers in Proc. ACL 2014 that mention embeddings.

See all papers in Proc. ACL that mention embeddings.

Back to top.

word embeddings

Appears in 25 sentences as: word embedding (2) Word embeddings (2) word embeddings (21)
In How much do word embeddings encode about syntax?
  1. Do continuous word embeddings encode any useful information for constituency parsing?
    Page 1, “Abstract”
  2. We isolate three ways in which word embeddings might augment a state-of-the-art statistical parser: by connecting out-of-vocabulary words to known ones, by encouraging common behavior among related in-vocabulary words, and by directly providing features for the lexicon.
    Page 1, “Abstract”
  3. Our results support an overall hypothesis that word embeddings import syntactic information that is ultimately redundant with distinctions learned from tree-banks in other ways.
    Page 1, “Abstract”
  4. This paper investigates a variety of ways in which word embeddings might augment a constituency parser with a discrete state space.
    Page 1, “Introduction”
  5. While word embeddings can be constructed directly from surface distributional statistics, as in LSA, more sophisticated tools for unsupervised extraction of word representations have recently gained popularity (Collobert et al., 2011; Mikolov et al., 2013a).
    Page 1, “Introduction”
  6. (Turian et al., 2010) have been shown to benefit from the inclusion of word embeddings as features.
    Page 1, “Introduction”
  7. In the other direction, access to a syntactic parse has been shown to be useful for constructing word embeddings for phrases composi-tionally (Hermann and Blunsom, 2013; Andreas and Ghahramani, 2013).
    Page 1, “Introduction”
  8. It has been less clear how (and indeed whether) word embeddings in and of themselves are useful for constituency parsing.
    Page 1, “Introduction”
  9. There certainly exist competitive parsers that internally represent lexical items as real-valued vectors, such as the neural network-based parser of Henderson (2004), and even parsers which use pre-trained word embeddings to represent the lexicon, such as Socher et al.
    Page 1, “Introduction”
  10. In order to isolate the contribution from word embeddings , it is useful to demonstrate improvement over a parser that already achieves state-of-the-art performance without vector representations.
    Page 1, “Introduction”
  11. With extremely limited training data, parser extensions using word embeddings give modest improvements in accuracy (relative error reduction on the order of 1.5%).
    Page 2, “Introduction”

See all papers in Proc. ACL 2014 that mention word embeddings.

See all papers in Proc. ACL that mention word embeddings.

Back to top.

constituency parsing

Appears in 8 sentences as: constituency parser (2) constituency parsers (2) constituency parses (1) constituency parsing (3)
In How much do word embeddings encode about syntax?
  1. Do continuous word embeddings encode any useful information for constituency parsing ?
    Page 1, “Abstract”
  2. This paper investigates a variety of ways in which word embeddings might augment a constituency parser with a discrete state space.
    Page 1, “Introduction”
  3. It has been less clear how (and indeed whether) word embeddings in and of themselves are useful for constituency parsing .
    Page 1, “Introduction”
  4. The fact that word embedding features result in nontrivial gains for discriminative dependency parsing (Bansal et al., 2014), but do not appear to be effective for constituency parsing , points to an interesting structural difference between the two tasks.
    Page 2, “Introduction”
  5. We hypothesize that dependency parsers benefit from the introduction of features (like clusters and embeddings) that provide syntactic abstractions; but that constituency parsers already have access to such abstractions in the form of supervised preterminal tags.
    Page 2, “Introduction”
  6. We are interested in the question of whether a state-of-the-art discrete-variable constituency parser can be improved with word embeddings, and, more precisely, what aspect (or aspects) of the parser can be altered to make effective use of embeddings.
    Page 2, “Three possible benefits of word embeddings”
  7. It is important to emphasize that these results do not argue against the use of continuous representations in a parser’s state space, nor argue more generally that constituency parsers cannot possibly benefit from word embeddings.
    Page 5, “Conclusion”
  8. Indeed, our results suggest a hypothesis that word embeddings are useful for dependency parsing (and perhaps other tasks) because they provide a level of syntactic abstraction which is explicitly annotated in constituency parses .
    Page 5, “Conclusion”

See all papers in Proc. ACL 2014 that mention constituency parsing.

See all papers in Proc. ACL that mention constituency parsing.

Back to top.

Berkeley parser

Appears in 7 sentences as: Berkeley parser (7)
In How much do word embeddings encode about syntax?
  1. These are precisely the kinds of distinctions between determiners that state-splitting in the Berkeley parser has shown to be useful (Petrov and Klein, 2007), and existing work (Mikolov et al., 2013b) has observed that such regular embedding structure extends to many other parts of speech.
    Page 2, “Three possible benefits of word embeddings”
  2. For the experiments in this paper, we will use the Berkeley parser (Petrov and Klein, 2007) and the related Maryland parser (Huang and Harper, 2011).
    Page 3, “Parser extensions”
  3. The Berkeley parser induces a latent, state-split PCFG in which each symbol V of the (observed) X-bar grammar is refined into a set of more specific symbols {V1, V2, .
    Page 3, “Parser extensions”
  4. First, these parsers are among the best in the literature, with a test performance of 90.7 F1 for the baseline Berkeley parser on the Wall Street Journal corpus (compared to 90.4 for Socher et al.
    Page 3, “Parser extensions”
  5. This ensures that our model continues to include the original Berkeley parser model as a limiting case.
    Page 4, “Parser extensions”
  6. We use the Maryland implementation of the Berkeley parser as our baseline for the kernel-smoothed lexicon, and the Maryland featured parser as our baseline for the embedding-featured lexicon.1 For all experiments, we use 50-dimensional word embeddings.
    Page 4, “Experimental setup”
  7. For each training corpus size we also choose a different setting of the number of splitting iterations over which the Berkeley parser is run; for 300 sentences this is two splits, and for
    Page 4, “Experimental setup”

See all papers in Proc. ACL 2014 that mention Berkeley parser.

See all papers in Proc. ACL that mention Berkeley parser.

Back to top.

dependency parsers

Appears in 5 sentences as: Dependency parsers (1) dependency parsers (2) dependency parsing (2)
In How much do word embeddings encode about syntax?
  1. Dependency parsers have seen gains from distributional statistics in the form of discrete word clusters (Koo et al., 2008), and recent work (Bansal et al., 2014) suggests that similar gains can be derived from embeddings like the ones used in this paper.
    Page 1, “Introduction”
  2. The fact that word embedding features result in nontrivial gains for discriminative dependency parsing (Bansal et al., 2014), but do not appear to be effective for constituency parsing, points to an interesting structural difference between the two tasks.
    Page 2, “Introduction”
  3. We hypothesize that dependency parsers benefit from the introduction of features (like clusters and embeddings) that provide syntactic abstractions; but that constituency parsers already have access to such abstractions in the form of supervised preterminal tags.
    Page 2, “Introduction”
  4. However, the failure to uncover gains when searching across a variety of possible mechanisms for improvement, training procedures for embeddings, hyperparam-eter settings, tasks, and resource scenarios suggests that these gains (if they do exist) are extremely sensitive to these training conditions, and not nearly as accessible as they seem to be in dependency parsers .
    Page 5, “Conclusion”
  5. Indeed, our results suggest a hypothesis that word embeddings are useful for dependency parsing (and perhaps other tasks) because they provide a level of syntactic abstraction which is explicitly annotated in constituency parses.
    Page 5, “Conclusion”

See all papers in Proc. ACL 2014 that mention dependency parsers.

See all papers in Proc. ACL that mention dependency parsers.

Back to top.

treebank

Appears in 5 sentences as: Treebank (1) treebank (4)
In How much do word embeddings encode about syntax?
  1. Example: the infrequently-occurring treebank tag UH dominates greetings (among other interjections).
    Page 2, “Three possible benefits of word embeddings”
  2. Example: individual first names are also rare in the treebank , but tend to cluster together in distributional representations.
    Page 2, “Three possible benefits of word embeddings”
  3. Experiments are conducted on the Wall Street Journal portion of the English Penn Treebank .
    Page 4, “Experimental setup”
  4. We prepare three training sets: the complete training set of 39,832 sentences from the treebank (sections 2 through 21), a smaller training set, consisting of the first 3000 sentences, and an even smaller set of the first 300.
    Page 4, “Experimental setup”
  5. test on the French treebank (the “French” column).
    Page 5, “Results”

See all papers in Proc. ACL 2014 that mention treebank.

See all papers in Proc. ACL that mention treebank.

Back to top.

vector representation

Appears in 3 sentences as: vector representation (2) vector representations (1)
In How much do word embeddings encode about syntax?
  1. In order to isolate the contribution from word embeddings, it is useful to demonstrate improvement over a parser that already achieves state-of-the-art performance without vector representations .
    Page 1, “Introduction”
  2. gb(w) is the vector representation of the word 212, am, are per-basis weights, and 6 is an inverse radius parameter which determines the strength of the smoothing.
    Page 3, “Parser extensions”
  3. vector representation .
    Page 4, “Parser extensions”

See all papers in Proc. ACL 2014 that mention vector representation.

See all papers in Proc. ACL that mention vector representation.

Back to top.