Low-Rank Tensors for Scoring Dependency Structures
Lei, Tao and Xin, Yu and Zhang, Yuan and Barzilay, Regina and Jaakkola, Tommi

Article Structure

Abstract

Accurate scoring of syntactic structures such as head-modifier arcs in dependency parsing typically requires rich, high-dimensional feature representations.

Introduction

Finding an expressive representation of input sentences is crucial for accurate parsing.

Related Work

Selecting Features for Dependency Parsing A great deal of parsing research has been dedicated to feature engineering (Lazaridou et al., 2013; Marton et al., 2010; Marton et al., 2011).

Problem Formulation

We will commence here by casting first-order dependency parsing as a tensor estimation problem.

Learning

The training set D = consists of N pairs, where each pair consists of a sentence ac, and the corresponding gold (target) parse yi.

Experimental Setup

Datasets We test our dependency model on 14 languages, including the English dataset from CoNLL 2008 shared tasks and all 13 datasets from CoNLL 2006 shared tasks (Buchholz and Marsi, 2006; Surdeanu et al., 2008).

Results

Overall Performance Table 2 shows the performance of our model and the baselines on 14 CoNLL datasets.

Conclusions

Accurate scoring of syntactic structures such as head-modifier arcs in dependency parsing typically requires rich, high-dimensional feature representations.

Topics

feature vectors

Appears in 16 sentences as: feature vector (4) feature vectors (13)
In Low-Rank Tensors for Scoring Dependency Structures
  1. In this paper, we use tensors to map high-dimensional feature vectors into low dimensional representations.
    Page 1, “Abstract”
  2. The exploding dimensionality of rich feature vectors must then be balanced with the difficulty of effectively learning the associated parameters from limited training data.
    Page 1, “Introduction”
  3. We depart from this view and leverage high-dimensional feature vectors by mapping them into low dimensional representations.
    Page 1, “Introduction”
  4. We begin by representing high-dimensional feature vectors as multi-way cross-products of smaller feature vectors that represent words and their syntactic relations (arcs).
    Page 1, “Introduction”
  5. We can alternatively specify arc features in terms of rank-l tensors by taking the Kronecker product of simpler feature vectors associated with the head (vector gbh E R”), and modifier (vector gbm E R”), as well as the arc itself (vector ohm E Rd).
    Page 4, “Problem Formulation”
  6. Here ohm is much lower dimensional than the MST arc feature vector gbhmm discussed earlier.
    Page 4, “Problem Formulation”
  7. By taking the crossproduct of all these component feature vectors , we obtain the full feature representation for arc h —> m as a rank-l tensor
    Page 4, “Problem Formulation”
  8. Given the typical dimensions of the component feature vectors , gbh, gbm, ohm, it is not even possible to store all the parameters in A.
    Page 4, “Problem Formulation”
  9. Indeed, in the full English training set of CoNLL-2008, the tensor involves around 8 x 1011 entries while the MST feature vector has approximately 1.5 x 107 features.
    Page 4, “Problem Formulation”
  10. For example, we can easily incorporate additional useful features in the feature vectors gbh, gbm and ohm, since the low-rank assumption (for small enough 7“) effectively counters the otherwise uncontrolled feature expansion.
    Page 4, “Problem Formulation”
  11. Moreover, by controlling the amount of information we can extract from each of the component feature vectors (via rank 7“), the statistical estimation problem does not scale dramatically with the dimensions of gbh, gbm and ohm.
    Page 4, “Problem Formulation”

See all papers in Proc. ACL 2014 that mention feature vectors.

See all papers in Proc. ACL that mention feature vectors.

Back to top.

dependency parsing

Appears in 13 sentences as: Dependency Parsing (3) dependency parsing (10)
In Low-Rank Tensors for Scoring Dependency Structures
  1. Accurate scoring of syntactic structures such as head-modifier arcs in dependency parsing typically requires rich, high-dimensional feature representations.
    Page 1, “Abstract”
  2. Traditionally, parsing research has focused on modeling the direct connection between the features and the predicted syntactic relations such as head-modifier (arc) relations in dependency parsing .
    Page 1, “Introduction”
  3. We implement the low-rank factorization model in the context of first— and third-order dependency parsing .
    Page 2, “Introduction”
  4. Selecting Features for Dependency Parsing A great deal of parsing research has been dedicated to feature engineering (Lazaridou et al., 2013; Marton et al., 2010; Marton et al., 2011).
    Page 2, “Related Work”
  5. Embedding for Dependency Parsing A lot of recent work has been done on mapping words into vector spaces (Collobert and Weston, 2008; Turian et al., 2010; Dhillon et al., 2011; Mikolov et al., 2013).
    Page 2, “Related Work”
  6. We will commence here by casting first-order dependency parsing as a tensor estimation problem.
    Page 3, “Problem Formulation”
  7. We will start by introducing the notation used in the paper, followed by a more formal description of our dependency parsing task.
    Page 3, “Problem Formulation”
  8. 3.2 Dependency Parsing
    Page 3, “Problem Formulation”
  9. By learning parameters U, V, and W that function well in dependency parsing , we also learn context-dependent embeddings for words and arcs.
    Page 4, “Problem Formulation”
  10. We expect a dependency parsing model to benefit from several aspects of the low-rank tensor scoring.
    Page 4, “Problem Formulation”
  11. Methods We compare our model to MST and Turbo parsers on non-projective dependency parsing .
    Page 6, “Experimental Setup”

See all papers in Proc. ACL 2014 that mention dependency parsing.

See all papers in Proc. ACL that mention dependency parsing.

Back to top.

POS tags

Appears in 10 sentences as: POS tag (2) POS tags (9)
In Low-Rank Tensors for Scoring Dependency Structures
  1. This low dimensional syntactic abstraction can be thought of as a proxy to manually constructed POS tags .
    Page 2, “Introduction”
  2. For instance, on the English dataset, the low-rank model trained without POS tags achieves 90.49% on first-order parsing, while the baseline gets 86.70% if trained under the same conditions, and 90.58% if trained with 12 core POS tags .
    Page 2, “Introduction”
  3. pos, form, lemma and morph stand for the fine POS tag , word form, word lemma and the morphology feature (provided in CoNLL format file) of the current word.
    Page 4, “Problem Formulation”
  4. For example, pos-p means the POS tag to the left of the current word in the sentence.
    Page 4, “Problem Formulation”
  5. Other possible features include, for example, the label of the arc h —> m, the POS tags between the head and the modifier, boolean flags which indicate the occurence of in-between punctutations or conjunctions, etc.
    Page 4, “Problem Formulation”
  6. These datasets include manually annotated dependency trees, POS tags and morphological information.
    Page 6, “Experimental Setup”
  7. In contrast, assume we take the crossproduct of the auxiliary word vector values, POS tags and lexical items of a word and its context, and add the crossed values into a normal model (in gbhmm).
    Page 7, “Experimental Setup”
  8. The rationale is that given all other features, the model would induce representations that play a similar role to POS tags .
    Page 8, “Results”
  9. Table 4: The first three columns show parsing results when models are trained without POS tags .
    Page 8, “Results”
  10. the performance of a parser trained with 12 Core POS tags .
    Page 8, “Results”

See all papers in Proc. ACL 2014 that mention POS tags.

See all papers in Proc. ACL that mention POS tags.

Back to top.

embeddings

Appears in 8 sentences as: embeddings (8)
In Low-Rank Tensors for Scoring Dependency Structures
  1. This is problematic when features lack clear linguistic meaning as in embeddings or when the information is blended across features.
    Page 1, “Abstract”
  2. First, features may lack clear linguistic interpretation as in distributional features or continuous vector embeddings of words.
    Page 1, “Introduction”
  3. 0 Our low dimensional embeddings are tailored to the syntactic context of words (head, modifier).
    Page 2, “Introduction”
  4. Word-level vector space embeddings have so far had limited impact on parsing performance.
    Page 2, “Related Work”
  5. This framework enables us to learn new syntactically guided embeddings while also leveraging separately estimated word vectors as starting features, leading to improved parsing performance.
    Page 2, “Related Work”
  6. By learning parameters U, V, and W that function well in dependency parsing, we also learn context-dependent embeddings for words and arcs.
    Page 4, “Problem Formulation”
  7. For this purpose, we train a model with only a tensor component (such that it has to learn an accurate tensor) on the English dataset and obtain low dimensional embeddings U gbw and ngw for each word.
    Page 8, “Results”
  8. The upper part shows our learned embeddings group words with similar syntactic behavior.
    Page 9, “Results”

See all papers in Proc. ACL 2014 that mention embeddings.

See all papers in Proc. ACL that mention embeddings.

Back to top.

CoNLL

Appears in 7 sentences as: CoNLL (9)
In Low-Rank Tensors for Scoring Dependency Structures
  1. The model was evaluated on 14 languages, using dependency data from CoNLL 2008 and CoNLL 2006.
    Page 2, “Introduction”
  2. pos, form, lemma and morph stand for the fine POS tag, word form, word lemma and the morphology feature (provided in CoNLL format file) of the current word.
    Page 4, “Problem Formulation”
  3. Datasets We test our dependency model on 14 languages, including the English dataset from CoNLL 2008 shared tasks and all 13 datasets from CoNLL 2006 shared tasks (Buchholz and Marsi, 2006; Surdeanu et al., 2008).
    Page 6, “Experimental Setup”
  4. Overall Performance Table 2 shows the performance of our model and the baselines on 14 CoNLL datasets.
    Page 7, “Results”
  5. Figure 1 shows the average UAS on CoNLL test datasets after each training epoch.
    Page 7, “Results”
  6. Figure 1: Average UAS on CoNLL testsets after different epochs.
    Page 8, “Results”
  7. has the shortest sentence length in CoNLL 2006.
    Page 9, “Results”

See all papers in Proc. ACL 2014 that mention CoNLL.

See all papers in Proc. ACL that mention CoNLL.

Back to top.

vector representations

Appears in 7 sentences as: vector representation (2) vector representations (5)
In Low-Rank Tensors for Scoring Dependency Structures
  1. Even in the case of first-order parsers, this results in a high-dimensional vector representation of each arc.
    Page 1, “Introduction”
  2. participating in an arc, such as continuous vector representations of words.
    Page 1, “Introduction”
  3. Finally, we demonstrate that the model can successfully leverage word vector representations , in contrast to the baselines.
    Page 2, “Introduction”
  4. Traditionally, these vector representations have been derived primarily from co-occurrences of words within sentences, ignoring syntactic roles of the co-occurring words.
    Page 2, “Related Work”
  5. While this method learns to map word combinations into vectors, it builds on existing word-level vector representations .
    Page 2, “Related Work”
  6. Specifically, U gbh (for a given sentence, suppressed) is an 7“ dimensional vector representation of the word corresponding to h as a head word.
    Page 4, “Problem Formulation”
  7. To add auxiliary word vector representations , we use the publicly available word vectors (Cirik
    Page 6, “Experimental Setup”

See all papers in Proc. ACL 2014 that mention vector representations.

See all papers in Proc. ACL that mention vector representations.

Back to top.

UAS

Appears in 6 sentences as: UAS (6)
In Low-Rank Tensors for Scoring Dependency Structures
  1. We also obtain the best published UAS results on 5 languages.1
    Page 1, “Abstract”
  2. As the evaluation measure, we use unlabeled attachment scores ( UAS ) excluding punctuation.
    Page 7, “Experimental Setup”
  3. Our model also achieves the best UAS on 5 languages.
    Page 7, “Results”
  4. Figure 1 shows the average UAS on CoNLL test datasets after each training epoch.
    Page 7, “Results”
  5. Figure 1: Average UAS on CoNLL testsets after different epochs.
    Page 8, “Results”
  6. Learning of the tensor is harder because the scoring function is not linear (nor convex) with respect to parameters U, V and W. However, the tensor scoring component achieves better generalization on the test data, resulting in better UAS than NT-lst after 8 training epochs.
    Page 8, “Results”

See all papers in Proc. ACL 2014 that mention UAS.

See all papers in Proc. ACL that mention UAS.

Back to top.

feature templates

Appears in 5 sentences as: feature templates (5)
In Low-Rank Tensors for Scoring Dependency Structures
  1. A predominant way to counter the high dimensionality of features is to manually design or select a meaningful set of feature templates , which are used to generate different types of features (McDonald et al., 2005a; Koo and Collins, 2010; Martins et al., 2013).
    Page 1, “Introduction”
  2. Table 1: Word feature templates used by our model.
    Page 4, “Problem Formulation”
  3. Features For the arc feature vector gbhflm, we use the same set of feature templates as MST V0.5 .1.
    Page 6, “Experimental Setup”
  4. For head/modifier vector gbh and gbm, we show the complete set of feature templates used by our model in Table 1.
    Page 6, “Experimental Setup”
  5. Finally, we use a similar set of feature templates as Turbo V2.1 for 3rd order parsing.
    Page 6, “Experimental Setup”

See all papers in Proc. ACL 2014 that mention feature templates.

See all papers in Proc. ACL that mention feature templates.

Back to top.

part-of-speech

Appears in 4 sentences as: part-of-speech (4)
In Low-Rank Tensors for Scoring Dependency Structures
  1. Syntactic relations manifest themselves in a broad range of surface indicators, ranging from morphological to lexical, including positional and part-of-speech (POS) tagging features.
    Page 1, “Introduction”
  2. For instance, morphological properties are closely tied to part-of-speech tags, which in turn relate to positional features.
    Page 1, “Introduction”
  3. The power of the low-rank model becomes evident in the absence of any part-of-speech tags.
    Page 2, “Introduction”
  4. Syntactic Abstraction without POS Since our model learns a compressed representation of feature vectors, we are interested to measure its performance when part-of-speech tags are not provided (See Table 4).
    Page 8, “Results”

See all papers in Proc. ACL 2014 that mention part-of-speech.

See all papers in Proc. ACL that mention part-of-speech.

Back to top.

model parameters

Appears in 3 sentences as: model parameter (1) model parameters (2)
In Low-Rank Tensors for Scoring Dependency Structures
  1. We will directly learn a low-rank tensor A (because r is small) in this form as one of our model parameters .
    Page 3, “Problem Formulation”
  2. where 6 6 RL, U 6 RM”, V 6 RM”, and W E 1Rer are the model parameters to be learned.
    Page 5, “Problem Formulation”
  3. We should note that since our model parameter A is represented and learned in the low-rank form, we only have to store and maintain the low-rank projections U gbh, ngm and nghm rather than explicitly calculate the feature tensor gbh®gbm®gbh,m.
    Page 7, “Experimental Setup”

See all papers in Proc. ACL 2014 that mention model parameters.

See all papers in Proc. ACL that mention model parameters.

Back to top.

parsing model

Appears in 3 sentences as: parsing model (3)
In Low-Rank Tensors for Scoring Dependency Structures
  1. We expect a dependency parsing model to benefit from several aspects of the low-rank tensor scoring.
    Page 4, “Problem Formulation”
  2. Combined Scoring Our parsing model aims to combine the strengths of both traditional features from the MST/Turbo parser as well as the new low-rank tensor features.
    Page 5, “Problem Formulation”
  3. For our parser, we train both a first-order parsing model (as described in Section 3 and 4) as well as a third-order model.
    Page 6, “Experimental Setup”

See all papers in Proc. ACL 2014 that mention parsing model.

See all papers in Proc. ACL that mention parsing model.

Back to top.

part-of-speech tags

Appears in 3 sentences as: part-of-speech tags (3)
In Low-Rank Tensors for Scoring Dependency Structures
  1. For instance, morphological properties are closely tied to part-of-speech tags , which in turn relate to positional features.
    Page 1, “Introduction”
  2. The power of the low-rank model becomes evident in the absence of any part-of-speech tags .
    Page 2, “Introduction”
  3. Syntactic Abstraction without POS Since our model learns a compressed representation of feature vectors, we are interested to measure its performance when part-of-speech tags are not provided (See Table 4).
    Page 8, “Results”

See all papers in Proc. ACL 2014 that mention part-of-speech tags.

See all papers in Proc. ACL that mention part-of-speech tags.

Back to top.

syntactic context

Appears in 3 sentences as: syntactic context (3)
In Low-Rank Tensors for Scoring Dependency Structures
  1. 0 Our low dimensional embeddings are tailored to the syntactic context of words (head, modifier).
    Page 2, “Introduction”
  2. More interestingly, we can consider the impact of syntactic context on the derived projections.
    Page 8, “Results”
  3. The two bottom parts of the table demonstrate that how the projections change depending on the syntactic context of the word.
    Page 9, “Results”

See all papers in Proc. ACL 2014 that mention syntactic context.

See all papers in Proc. ACL that mention syntactic context.

Back to top.

word-level

Appears in 3 sentences as: Word-level (1) word-level (2)
In Low-Rank Tensors for Scoring Dependency Structures
  1. Nevertheless, any such word-level representation can be used to offset inherent sparsity problems associated with full lexi-calization (Cirik and Sensoy, 2013).
    Page 2, “Related Work”
  2. Word-level vector space embeddings have so far had limited impact on parsing performance.
    Page 2, “Related Work”
  3. While this method learns to map word combinations into vectors, it builds on existing word-level vector representations.
    Page 2, “Related Work”

See all papers in Proc. ACL 2014 that mention word-level.

See all papers in Proc. ACL that mention word-level.

Back to top.