Vector space semantics with frequency-driven motifs
Srivastava, Shashank and Hovy, Eduard

Article Structure

Abstract

Traditional models of distributional semantics suffer from computational issues such as data sparsity for individual lex-emes and complexities of modeling semantic composition when dealing with structures larger than single lexical items.

Introduction

Meaning in language is a confluence of experien-tially acquired semantics of words or multiword phrases, and their semantic composition to create new meanings.

Topics

segmentation model

Appears in 17 sentences as: segmentation model (14) segmentation models (1) segmentations model (2)
In Vector space semantics with frequency-driven motifs
  1. We design a segmentation model to optimally partition a sentence into lineal constituents, which can be used to define distributional contexts that are less noisy, semantically more interpretable, and linguistically dis-ambiguated.
    Page 1, “Abstract”
  2. While existing work has focused on the classification task of categorizing a phrasal constituent as a MWE or a non-MWE, the general ideas of most of these works are in line with our current framework, and the feature-set for our motif segmentation model is designed to subsume most of these ideas.
    Page 4, “Introduction”
  3. 3.1 Linear segmentation model
    Page 4, “Introduction”
  4. The segmentation model forms the core of the framework.
    Page 4, “Introduction”
  5. The segmentation model is a chain LVM (latent variable model) that aims to maximize a linear objective defined by:
    Page 4, “Introduction”
  6. In this section, we describe the principal features used in the segmentation model Transitional features and penalties:
    Page 5, “Introduction”
  7. Additionally, a few feature for the segmentations model contained minor orthographic features based on word shape (length and capitalization patterns).
    Page 6, “Introduction”
  8. With the segmentation model described in the previous section, we process text from the English Gigaword corpus and the Simple English Wikipedia to partition sentences into motifs.
    Page 6, “Introduction”
  9. Since the segmentation model accounts for the contexts of the entire sentence in determining motifs, different instances of the same token could evoke different meaning representations.
    Page 6, “Introduction”
  10. Consider the following sentences tagged by the segmentation model , that would correspond to different representations of the token ‘remains’: once as a standalone motif, and once as part of an encompassing bigram motif (‘remains classified’).
    Page 6, “Introduction”
  11. We first quantitatively and qualitatively analyze the performance of the segmentation model , and then evaluate the distributional motif representations learnt by the model through two downstream applications.
    Page 7, “Introduction”

See all papers in Proc. ACL 2014 that mention segmentation model.

See all papers in Proc. ACL that mention segmentation model.

Back to top.

embeddings

Appears in 8 sentences as: embeddings (8)
In Vector space semantics with frequency-driven motifs
  1. Hellinger PCA embeddings learnt using the framework show competitive results on empirical tasks.
    Page 1, “Abstract”
  2. While word embeddings and language models from such methods have been useful for tasks such as relation classification, polarity detection, event coreference and parsing; much of existing literature on composition is based on abstract linguistic theory and conjecture, and there is little evidence to support that learnt representations for larger linguistic units correspond to their semantic meanings.
    Page 3, “Introduction”
  3. While this framework is attractive in the lack of assumptions on representation that it makes, the use of distributional embeddings for individual tokens means
    Page 3, “Introduction”
  4. Recent work (Lebret and Lebret, 2013) has shown that the Hellinger distance is an especially effective measure in learning distributional embeddings , with Hellinger PCA being much more computationally inexpensive than neural language modeling approaches, while performing much better than standard PCA, and competitive with the state-of-the-art in downstream evaluations.
    Page 7, “Introduction”
  5. For this task, the motif based distributional embeddings vastly outperform a conventional distributional model (DSM) based on token distributions, as well as additive (AVM) and multiplicative (MVM) models of vector compositionality, as
    Page 8, “Introduction”
  6. The model is competitive with the state-of-the-art VTK (Srivastava et al., 2013) that uses the SENNA neural embeddings by Collobert et al.
    Page 9, “Introduction”
  7. Table 5 shows that the motif-based DSM does better than discriminative models such as CRFs and SVMs, and also slightly improves on the VTK kernel with distributional embeddings .
    Page 9, “Introduction”
  8. Finally, we obtain motif representations in form of low-dimensional vector-space embeddings , and our experimental findings indicate value of the learnt representations in downstream applications.
    Page 9, “Introduction”

See all papers in Proc. ACL 2014 that mention embeddings.

See all papers in Proc. ACL that mention embeddings.

Back to top.

distributional semantics

Appears in 7 sentences as: Distributional semantic (1) distributional semantics (6)
In Vector space semantics with frequency-driven motifs
  1. Traditional models of distributional semantics suffer from computational issues such as data sparsity for individual lex-emes and complexities of modeling semantic composition when dealing with structures larger than single lexical items.
    Page 1, “Abstract”
  2. In this work, we present a frequency-driven paradigm for robust distributional semantics in terms of semantically cohesive lineal constituents, or motifs.
    Page 1, “Abstract”
  3. In particular, such a perspective can be especially advantageous for distributional semantics for reasons we outline below.
    Page 1, “Introduction”
  4. Distributional semantic models (DSMs) that represent words as distributions over neighbouring contexts have been particularly effective in capturing fine-grained lexical semantics (Tumey et al., 2010).
    Page 1, “Introduction”
  5. In this section, we define our frequency-driven framework for distributional semantics in detail.
    Page 4, “Introduction”
  6. A method towards frequency-driven distributional semantics could involve the following principal components:
    Page 4, “Introduction”
  7. We have presented a new frequency-driven framework for distributional semantics of not only lexical items but also longer cohesive motifs.
    Page 9, “Introduction”

See all papers in Proc. ACL 2014 that mention distributional semantics.

See all papers in Proc. ACL that mention distributional semantics.

Back to top.

semi-supervised

Appears in 7 sentences as: Semi-supervised (2) semi-supervised (7)
In Vector space semantics with frequency-driven motifs
  1. This is necessary for the scenario of semi-supervised learning of weights with partially annotated sentences, as described later.
    Page 5, “Introduction”
  2. Semi-supervised learning: In the semi-supervised case, the labels yz-(k) are known only for some of the tokens in x(k).
    Page 5, “Introduction”
  3. The semi-supervised approach enables incorporation of significantly more training data.
    Page 5, “Introduction”
  4. This would involve initializing the weights prior to the semi-supervised procedure with the weights from the supervised learning model, so as to seed the semi-supervised approach with reasonable model, and use the partially annotated data to fine-tune the supervised model.
    Page 5, “Introduction”
  5. Semi-supervised 0.30 0.17 0.22 SUPCI'ViSCd + annealing
    Page 7, “Introduction”
  6. The supervised model expectedly outperforms both the rule-based and the semi-supervised systems.
    Page 8, “Introduction”
  7. However, the supervised learning model with subsequent annealing outperforms the supervised model in terms of both precision and recall; showing the utility of the semi-supervised method when seeded with a good initial model, and the additive value of partially labeled data.
    Page 8, “Introduction”

See all papers in Proc. ACL 2014 that mention semi-supervised.

See all papers in Proc. ACL that mention semi-supervised.

Back to top.

Viterbi

Appears in 7 sentences as: Viterbi (7)
In Vector space semantics with frequency-driven motifs
  1. in linear time (in the number of tokens) following the generalized Viterbi algorithm.
    Page 5, “Introduction”
  2. A slightly modified version of Viterbi could also be used to find segmentations that are constrained to agree with some given motif boundaries, but can segment other parts of the sentence optimally under these constraints.
    Page 5, “Introduction”
  3. Here y’ = Decode(x(k),w) is the optimal Viterbi decoding using the current estimates of the weights.
    Page 5, “Introduction”
  4. Implicitly, the weight learning algorithm can be seen as a gradient descent procedure minimizing the difference between the scores of highest scoring ( Viterbi ) state sequences, and the label state sequences.
    Page 5, “Introduction”
  5. While the Viterbi algorithm can be used for tagging optimal state-sequences given the weights, the structured perceptron can learn optimal model weights given gold-standard sequence labels.
    Page 5, “Introduction”
  6. The algorithm proceeds as follows: in the E-step, we use the current values of weights to compute hard-expectations, i.e., the best scoring Viterbi sequences among those consistent with the observed state labels.
    Page 5, “Introduction”
  7. 5: Decode D with current 212 to find optimal Viterbi paths that agree with (partial) ground truths.
    Page 5, “Introduction”

See all papers in Proc. ACL 2014 that mention Viterbi.

See all papers in Proc. ACL that mention Viterbi.

Back to top.

segmentations

Appears in 6 sentences as: segmentations (6)
In Vector space semantics with frequency-driven motifs
  1. The model accounts for possible segmentations of a sentence into potential motifs, and prefers recurrent and cohesive motifs through features that capture frequency-based and statistical
    Page 4, “Introduction”
  2. A slightly modified version of Viterbi could also be used to find segmentations that are constrained to agree with some given motif boundaries, but can segment other parts of the sentence optimally under these constraints.
    Page 5, “Introduction”
  3. Additionally, a few feature for the segmentations model contained minor orthographic features based on word shape (length and capitalization patterns).
    Page 6, “Introduction”
  4. In an evaluation of the motif segmentations model within the perspective of our framework, we believe that exact correspondence to human judgment is unrealistic, since guiding principles for defining motifs, such as semantic cohesion, are hard to define and only serve as working principles.
    Page 7, “Introduction”
  5. Table 2: Results for motif segmentations
    Page 7, “Introduction”
  6. For a baseline, we consider a rule-based model that simply learns all ngram segmentations seen in the training data, and marks any occurrence of a matching token sequence as a motif; without taking neighbouring context into account.
    Page 7, “Introduction”

See all papers in Proc. ACL 2014 that mention segmentations.

See all papers in Proc. ACL that mention segmentations.

Back to top.

distributional representations

Appears in 5 sentences as: distributional representation (1) Distributional representations (1) distributional representations (3)
In Vector space semantics with frequency-driven motifs
  1. Notable among the most effective distributional representations are the recent deep-learning approaches by Socher et al.
    Page 3, “Introduction”
  2. With such a working definition, contiguous motifs are likely to make distributional representations less noisy and also assist in disambiguating context.
    Page 4, “Introduction”
  3. Also, the lack of specificity ensures that such motifs are common enough to meaningfully influence distributional representation beyond single tokens.
    Page 4, “Introduction”
  4. 4.2 Distributional representations
    Page 8, “Introduction”
  5. For evaluating distributional representations for motifs (in terms of other motifs) learnt by the framework, we test these representations in two downstream tasks: sentence polarity classification and metaphor detection.
    Page 8, “Introduction”

See all papers in Proc. ACL 2014 that mention distributional representations.

See all papers in Proc. ACL that mention distributional representations.

Back to top.

Tree Kernel

Appears in 5 sentences as: Tree Kernel (2) Tree kernels (1) tree kernels (2)
In Vector space semantics with frequency-driven motifs
  1. 2.2 Tree kernels
    Page 3, “Introduction”
  2. Tree Kernel methods have gained popularity in the last decade for capturing syntactic information in the structure of parse trees (Collins and Duffy, 2002; Moschitti, 2006).
    Page 3, “Introduction”
  3. (2013) have attempted to provide formulations to incorporate semantics into tree kernels through the use of distributional word vectors at the individual word-nodes.
    Page 3, “Introduction”
  4. Specifically, the ‘bag of words’ assumption in tree kernels doesn’t suffice for these lexemes, and a stronger semantic model is needed to capture phrasal semantics as well as diverging inter-word relations such as in ‘coffee table’ and ‘water table’.
    Page 3, “Introduction”
  5. For composing the motifs representations to get judgments on semantic similarity of sentences, we use our recent Vector Tree Kernel approach The VTK approach defines a convo-lutional kernel over graphs defined by the dependency parses of sentences, using a vector representation at each graph node that representing a single lexical token.
    Page 8, “Introduction”

See all papers in Proc. ACL 2014 that mention Tree Kernel.

See all papers in Proc. ACL that mention Tree Kernel.

Back to top.

perceptron

Appears in 4 sentences as: Perceptron (1) perceptron (3)
In Vector space semantics with frequency-driven motifs
  1. In this case, learning can follow the online structured perceptron learning procedure by Collins (2002), where weights updates for the k’th training example (x09), y("’)) are given as:
    Page 5, “Introduction”
  2. While the Viterbi algorithm can be used for tagging optimal state-sequences given the weights, the structured perceptron can learn optimal model weights given gold-standard sequence labels.
    Page 5, “Introduction”
  3. In the M-step, we take the decoded state-sequences in the E—step as observed, and run perceptron learning to update feature weights wi.
    Page 5, “Introduction”
  4. 6: Run Structured Perceptron algorithm with decoded tag-sequences to update weights w
    Page 5, “Introduction”

See all papers in Proc. ACL 2014 that mention perceptron.

See all papers in Proc. ACL that mention perceptron.

Back to top.

rule-based

Appears in 4 sentences as: Rule-based (1) rule-based (3)
In Vector space semantics with frequency-driven motifs
  1. \ P l R \ F Rule-based baseline 0.85 0.10 0.18 Supervised 0.62 0.28 0.39
    Page 7, “Introduction”
  2. For a baseline, we consider a rule-based model that simply learns all ngram segmentations seen in the training data, and marks any occurrence of a matching token sequence as a motif; without taking neighbouring context into account.
    Page 7, “Introduction”
  3. However, the rule-based method has a very row recall due to lack of generalization capabilities.
    Page 7, “Introduction”
  4. The supervised model expectedly outperforms both the rule-based and the semi-supervised systems.
    Page 8, “Introduction”

See all papers in Proc. ACL 2014 that mention rule-based.

See all papers in Proc. ACL that mention rule-based.

Back to top.

semantic similarity

Appears in 4 sentences as: semantic similarities (1) semantic similarity (4)
In Vector space semantics with frequency-driven motifs
  1. Instead of procuring explicit representations, the kernel paradigm directly focuses on the larger goal of quantifying semantic similarity of larger linguistic units.
    Page 3, “Introduction”
  2. Figure l: Tokenwise syntactic and semantic similarities don’t imply sentential semantic similarity
    Page 3, “Introduction”
  3. With such neighbourhood contexts, the distributional paradigm posits that semantic similarity between a pair of motifs can be given by a sense of ‘distance’ between the two distributions.
    Page 6, “Introduction”
  4. For composing the motifs representations to get judgments on semantic similarity of sentences, we use our recent Vector Tree Kernel approach The VTK approach defines a convo-lutional kernel over graphs defined by the dependency parses of sentences, using a vector representation at each graph node that representing a single lexical token.
    Page 8, “Introduction”

See all papers in Proc. ACL 2014 that mention semantic similarity.

See all papers in Proc. ACL that mention semantic similarity.

Back to top.

data sparsity

Appears in 3 sentences as: data sparsity (3)
In Vector space semantics with frequency-driven motifs
  1. Traditional models of distributional semantics suffer from computational issues such as data sparsity for individual lex-emes and complexities of modeling semantic composition when dealing with structures larger than single lexical items.
    Page 1, “Abstract”
  2. The framework subsumes issues such as differential compositional as well as non-compositional behavior of phrasal con-situents, and circumvents some problems of data sparsity by design.
    Page 1, “Abstract”
  3. Such a framework for distributional models avoids the issue of data sparsity in learning of representations for larger linguistic structures.
    Page 9, “Introduction”

See all papers in Proc. ACL 2014 that mention data sparsity.

See all papers in Proc. ACL that mention data sparsity.

Back to top.

learning algorithm

Appears in 3 sentences as: learning algorithm (2) learning algorithms (1)
In Vector space semantics with frequency-driven motifs
  1. Implicitly, the weight learning algorithm can be seen as a gradient descent procedure minimizing the difference between the scores of highest scoring (Viterbi) state sequences, and the label state sequences.
    Page 5, “Introduction”
  2. Pseudocode of the learning algorithm for the partially labeled case is given in Algorithm 1.
    Page 5, “Introduction”
  3. We see that while all three learning algorithms perform better than the baseline, the performance of the purely unsupervised system is inferior to supervised approaches.
    Page 7, “Introduction”

See all papers in Proc. ACL 2014 that mention learning algorithm.

See all papers in Proc. ACL that mention learning algorithm.

Back to top.

vector representations

Appears in 3 sentences as: vector representation (1) vector representations (2)
In Vector space semantics with frequency-driven motifs
  1. Finally, we perform SVD on the motif similarity matrix (with size of the order of the total vocabulary in the corpus), and retain the first k principal eigenvectors to obtain low-dimensional vector representations that are more convenient to work with.
    Page 7, “Introduction”
  2. For composing the motifs representations to get judgments on semantic similarity of sentences, we use our recent Vector Tree Kernel approach The VTK approach defines a convo-lutional kernel over graphs defined by the dependency parses of sentences, using a vector representation at each graph node that representing a single lexical token.
    Page 8, “Introduction”
  3. For this task, we again use the VTK formalism for combining vector representations of the individual motifs.
    Page 9, “Introduction”

See all papers in Proc. ACL 2014 that mention vector representations.

See all papers in Proc. ACL that mention vector representations.

Back to top.