Index of papers in Proc. ACL that mention
  • embeddings
Huang, Eric and Socher, Richard and Manning, Christopher and Ng, Andrew
Abstract
We present a new neural network architecture which 1) learns word embeddings that better capture the semantics of words by incorporating both local and global document context, and 2) accounts for homonymy and polysemy by learning multiple embeddings per word.
Experiments
In this section, we first present a qualitative analysis comparing the nearest neighbors of our model’s embeddings with those of others, showing our embeddings better capture the semantics of words, with the use of global context.
Experiments
For all experiments, our models use 50-dimensional embeddings .
Experiments
The nearest neighbors of “market” that C&W’s embeddings give are more constrained by the syntactic constraint that words in plural form are only close to other words in plural form, whereas our model captures that the singular and plural forms of a word are similar in meaning.
Global Context-Aware Neural Language Model
ng = Z maX(0, 1 — 9(8, d) + g(sw, d)) (l) wEV Collobert and Weston (2008) showed that this ranking approach can produce good word embeddings that are useful in several NLP tasks, and allows much faster training of the model compared to optimizing log-likelihood of the next word.
Global Context-Aware Neural Language Model
where [£131,132, ...,;vm] is the concatenation of the m word embeddings representing sequence 8, f is an element-wise activation function such as tanh, a1 6 Rh“ is the activation of the hidden layer with h hidden nodes, W1 6 WW) and W2 6 nlxh are respectively the first and second layer weights of the neural network, and b1, ()2 are the biases of each layer.
Global Context-Aware Neural Language Model
For the score of the global context, we represent the document also as an ordered list of word embeddings , d 2 (d1, d2, ..., dk).
Multi-Prototype Neural Language Model
We present a way to use our learned single-prototype embeddings to represent each context window, which can then be used by clustering to perform word sense discrimination (Schutze, 1998).
embeddings is mentioned in 19 sentences in this paper.
Topics mentioned in this paper:
Zhang, Jiajun and Liu, Shujie and Li, Mu and Zhou, Ming and Zong, Chengqing
Abstract
We propose Bilingually-constrained Recursive Auto-encoders (BRAE) to learn semantic phrase embeddings (compact vector representations for phrases), which can distinguish the phrases with different semantic meanings.
Introduction
The models using word embeddings as the direct inputs to DNN cannot make full use of the whole syntactic and semantic information of the phrasal translation rules.
Introduction
(2011) make the phrase embeddings capture the sentiment information.
Introduction
(2013a) enable the phrase embeddings to mainly capture the syntactic knowledge.
embeddings is mentioned in 34 sentences in this paper.
Topics mentioned in this paper:
Pei, Wenzhe and Ge, Tao and Chang, Baobao
Abstract
By exploiting tag embeddings and tensor-based transformation, MMTNN has the ability to model complicated interactions between tags and context characters.
Conventional Neural Network
The character embeddings are then stacked into a embedding matrix M 6 1Rde |.
Conventional Neural Network
We will analyze in more detail about the effect of character embeddings in Section 4.
Conventional Neural Network
The character embeddings extracted by the Lookup Table layer are then concatenated into a single vector a 6 Km, where H1 2 w - d is the size of Layer 1.
Introduction
between tags and context characters by exploiting tag embeddings and tensor-based transformation.
Max-Margin Tensor Neural Network
Similar to character embeddings, given a fixed-sized tag set T, the tag embeddings for tags are stored in a tag embedding matrix L E Rdx m, where d is the dimensionality
Max-Margin Tensor Neural Network
of the vector space (same with character embeddings ).
Max-Margin Tensor Neural Network
The tag embeddings start from a random initialization and can be automatically trained by back-propagation.
embeddings is mentioned in 33 sentences in this paper.
Topics mentioned in this paper:
Nguyen, Thien Huu and Grishman, Ralph
Abstract
This paper evaluates word embeddings and clustering on adapting feature-based relation extraction systems.
Abstract
We systematically explore various ways to apply word embeddings and show the best adaptation improvement by combining word cluster and word embedding information.
Introduction
valued features of words (such as word embeddings (Mnih and Hinton, 2007; Collobert and Weston, 2008)) effectively.
Introduction
ing word embeddings (Bengio et al., 2001; Bengio et al., 2003; Mnih and Hinton, 2007; Collobert and Weston, 2008; Turian et al., 2010) on feature-based methods to adapt RE systems to new domains.
Introduction
More importantly, we show empirically that word embeddings and word clusters capture different information and their combination would further improve the adaptability of relation extractors.
Related Work
Although word embeddings have been successfully employed in many NLP tasks (Collobert and Weston, 2008; Turian et al., 2010; Maas and Ng, 2010), the application of word embeddings in RE is very recent.
Related Work
(2010) propose an abstraction-augmented string kernel for bio-relation extraction via word embeddings .
Related Work
(2012) and Khashabi (2013) use pre-trained word embeddings as input for Matrix-Vector Recursive Neural Networks (MV—RNN) to learn compositional structures for RE.
embeddings is mentioned in 27 sentences in this paper.
Topics mentioned in this paper:
Hermann, Karl Moritz and Blunsom, Phil
Abstract
We present a novel technique for learning semantic representations, which extends the distributional hypothesis to multilingual data and joint-space embeddings .
Abstract
Our models leverage parallel data and learn to strongly align the embeddings of semantically equivalent sentences, while maintaining sufficient distance between those of dissimilar sentences.
Experiments
We also investigate the learned embeddings from a qualitative perspective in §5.4.
Experiments
All our embeddings have dimensionality d=128, with the margin set to m=d.6 Further, we use L2 regularization with A21 and step-size in {001,005}.
Experiments
This task involves learning language independent embeddings which are then used for document classification across the English-German language pair.
Introduction
Such word embeddings are naturally richer representations than those of symbolic or discrete models, and have been shown to be able to capture both syntactic and semantic information.
Introduction
this work, we extend this hypothesis to multilingual data and joint-space embeddings .
Overview
We describe a multilingual objective function that uses a noise-contrastive update between semantic representations of different languages to learn these word embeddings .
embeddings is mentioned in 28 sentences in this paper.
Topics mentioned in this paper:
Fu, Ruiji and Guo, Jiang and Qin, Bing and Che, Wanxiang and Wang, Haifeng and Liu, Ting
Abstract
This paper proposes a novel and effective method for the construction of semantic hierarchies based on word embeddings , which can be used to measure the semantic relationship between words.
Background
In this paper, we aim to identify hypemym—hyponym relations using word embeddings , which have been shown to preserve good properties for capturing semantic relationship between words.
Introduction
This paper proposes a novel approach for semantic hierarchy construction based on word embeddings .
Introduction
Word embeddings , also known as distributed word representations, typically represent words with dense, low-dimensional and real-valued vectors.
Introduction
Word embeddings have been empirically shown to preserve linguistic regularities, such as the semantic relationship between words (Mikolov et al., 2013b).
Method
Various models for learning word embeddings have been proposed, including neural net language models (Bengio et al., 2003; Mnih and Hinton, 2008; Mikolov et al., 2013b) and spectral models (Dhillon et al., 2011).
Method
(2013a) propose two log-linear models, namely the Skip-gram and CBOW model, to efficiently induce word embeddings .
Method
Therefore, we employ the Skip- gram model for estimating word embeddings in this study.
embeddings is mentioned in 22 sentences in this paper.
Topics mentioned in this paper:
Andreas, Jacob and Klein, Dan
Abstract
Do continuous word embeddings encode any useful information for constituency parsing?
Abstract
We isolate three ways in which word embeddings might augment a state-of-the-art statistical parser: by connecting out-of-vocabulary words to known ones, by encouraging common behavior among related in-vocabulary words, and by directly providing features for the lexicon.
Abstract
Despite small gains on extremely small supervised training sets, we find that extra information from embeddings appears to make little or no difference to a parser with adequate training data.
Introduction
This paper investigates a variety of ways in which word embeddings might augment a constituency parser with a discrete state space.
Introduction
While word embeddings can be constructed directly from surface distributional statistics, as in LSA, more sophisticated tools for unsupervised extraction of word representations have recently gained popularity (Collobert et al., 2011; Mikolov et al., 2013a).
Introduction
(Turian et al., 2010) have been shown to benefit from the inclusion of word embeddings as features.
embeddings is mentioned in 38 sentences in this paper.
Topics mentioned in this paper:
Turian, Joseph and Ratinov, Lev-Arie and Bengio, Yoshua
Abstract
We evaluate Brown clusters, Collobert and Weston (2008) embeddings, and HLBL (Mnih & Hinton, 2009) embeddings of words on both NER and chunking.
Distributed representations
Distributed word representations are called word embeddings .
Distributed representations
Word embeddings are typically induced using neural language models, which use neural networks as the underlying predictive model (Bengio, 2008).
Distributed representations
4.1 Collobert and Weston (2008) embeddings
Introduction
word embeddings using unsupervised approaches.
Introduction
(2009) about Collobert and Weston (2008) embeddings , given training improvements that we describe in Section 7.1.
embeddings is mentioned in 51 sentences in this paper.
Topics mentioned in this paper:
Labutov, Igor and Lipson, Hod
Abstract
Recently, with an increase in computing resources, it became possible to learn rich word embeddings from massive amounts of unlabeled data.
Abstract
However, some methods take days or weeks to learn good embeddings , and some are notoriously difficult to train.
Introduction
Moreover, we may already have on our hands embeddings for X and Y obtained from yet another (possibly unsupervised) task (C), in which X and Y are, for example, orthogonal.
Introduction
If the embeddings for task C happen to be learned from a much larger dataset, it would make sense to reuse task C embeddings , but adapt them for task A and/or task B.
Introduction
We will refer to task C and its embeddings as the source task and the source embeddings, and task A/B, and its embeddings as the target task and the target embeddings .
embeddings is mentioned in 38 sentences in this paper.
Topics mentioned in this paper:
Yang, Nan and Liu, Shujie and Li, Mu and Zhou, Ming and Yu, Nenghai
DNN for word alignment
Words are converted to embeddings using the lookup table LT, and the catenation of embeddings are fed to a classic neural network with two hidden-layers, and the output of the network is the our lexical translation score:
DNN structures for NLP
Word embeddings often implicitly encode syntactic or semantic knowledge of the words.
DNN structures for NLP
Assuming a finite sized vocabulary V, word embeddings form a (L x |V|)-dimension embedding matrix WV, where L is a predetermined embedding length; mapping words to embeddings is done by simply looking up their respective columns in the embedding matrix WV.
DNN structures for NLP
After words have been transformed to their embeddings , they can be fed into subsequent classical network layers to model highly nonlinear relations:
Introduction
Based on the above analysis, in this paper, both the words in the source and target sides are firstly mapped to a vector via a discriminatively trained word embeddings , and word pairs are scored by a multilayer neural network which takes rich contexts (surrounding words on both source and target sides) into consideration; and a HMM-like distortion model is applied on top of the neural network to characterize structural aspect of bilingual sentences.
Related Work
(Titov et al., 2012) learns a context-free cross-lingual word embeddings to facilitate cross-lingual information retrieval.
Training
Tunable parameters in neural network alignment model include: word embeddings in lookup table LT, parameters Wl, bl for linear transformations in the hidden layers of the neural network, and distortion parameters 3d of jump distance.
Training
Most parameters reside in the word embeddings .
Training
To get a good initial value, the usual approach is to pre-train the embeddings on a large monolingual corpus.
embeddings is mentioned in 21 sentences in this paper.
Topics mentioned in this paper:
Parikh, Ankur P. and Cohen, Shay B. and Xing, Eric P.
Abstract
The word embeddings are used during the leam-ing process, but the final decoder that the learning algorithm outputs maps a POS tag sequence a: to a parse tree.
Abstract
Then, latent states are generated for each bracket, and finally, the latent states at the yield of the bracketing parse tree generate the words of the sentence (in the form of embeddings ).
Abstract
L€t V I: {101, ..., mg, 21, ..., 2H}, With 20,- 1‘61)-resenting the word embeddings , and 21- representing the latent states of the bracketings.
embeddings is mentioned in 10 sentences in this paper.
Topics mentioned in this paper:
Tamura, Akihiro and Watanabe, Taro and Sumita, Eiichiro
Introduction
Specifically, our training encourages word embeddings to be consistent across alignment directions by introducing a penalty term that expresses the difference between embedding of words into an objective function.
RNN-based Alignment Model
In the lookup layer, each of these words is converted to its word embedding, and then the concatenation of the two embeddings (any) is fed to the hidden layer in the same manner as the FFNN-based model.
Related Work
Word embeddings are dense, low dimensional, and real-valued vectors that can capture syntactic and semantic properties of the words (Bengio et al., 2003).
Training
The constraint concretely enforces agreement in word embeddings of both directions.
Training
The proposed method trains two directional models concurrently based on the following objective by incorporating a penalty term that expresses the difference between word embeddings:
Training
where QFE (or 6gp) denotes the weights of layers in a source-to-target (or target-to-source) alignment model, 6,; denotes weights of a lookup layer, i.e., word embeddings , and 04 is a parameter that controls the strength of the agreement constraint.
embeddings is mentioned in 10 sentences in this paper.
Topics mentioned in this paper:
Hermann, Karl Moritz and Das, Dipanjan and Weston, Jason and Ganchev, Kuzman
Abstract
We present a novel technique for semantic frame identification using distributed representations of predicates and their syntactic context; this technique leverages automatic syntactic parses and a generic set of word embeddings .
Experiments
The second baseline, tries to decouple the WSABIE training from the embedding input, and trains a log linear model using the embeddings .
Experiments
Hyperparameters For our frame identification model with embeddings , we search for the WSABIE hyperparameters using the development data.
Frame Identification with Embeddings
First, we extract the words in the syntactic context of runs; next, we concatenate their word embeddings as described in §2.2 to create an initial vector space representation.
Frame Identification with Embeddings
Formally, let cc represent the actual sentence with a marked predicate, along with the associated syntactic parse tree; let our initial representation of the predicate context be Suppose that the word embeddings we start with are of dimension n. Then 9 is a function from a parsed sentence cc to Rm“, where k is the number of possible syntactic context types.
Frame Identification with Embeddings
So for example “He runs the company” could help the model disambiguate “He owns the company.” Moreover, since g(:c) relies on word embeddings rather than word identities, information is shared between words.
Overview
We present a model that takes word embeddings as input and learns to identify semantic frames.
Overview
We use word embeddings to represent the syntactic context of a particular predicate instance as a vector.
embeddings is mentioned in 9 sentences in this paper.
Topics mentioned in this paper:
Hermann, Karl Moritz and Blunsom, Phil
Abstract
We use this model to learn high dimensional embeddings for sentences and evaluate them in a range of tasks, demonstrating that the incorporation of syntax allows a concise model to learn representations that are both effective and general.
Experiments
We conclude with some qualitative analysis to get a better idea of whether the combination of CCG and RAE can learn semantically expressive embeddings .
Experiments
use word-vectors of size 50, initialized using the embeddings provided by Turian et al.
Experiments
Experiment 1: Semi-Supervised Training In the first experiment, we use the semi-supervised training strategy described previously and initialize our models with the embeddings provided by Turian et al.
embeddings is mentioned in 9 sentences in this paper.
Topics mentioned in this paper:
Tang, Duyu and Wei, Furu and Yang, Nan and Zhou, Ming and Liu, Ting and Qin, Bing
Related Work
t. The embeddings of C&W (Collobert et al., 2011), word2vec4, WVSA (Maas et al., 2011) and our models are trained with the same dataset and same parameter setting.
Related Work
ReEmb(C&W) and ReEmb(w2v) stand for the use of embeddings learned from 10 million distant-supervised tweets with C&W and word2vec, respectively.
Related Work
Table 3: Macro-F1 on positive/negative classification of tweets with different word embeddings .
embeddings is mentioned in 9 sentences in this paper.
Topics mentioned in this paper:
Lei, Tao and Xin, Yu and Zhang, Yuan and Barzilay, Regina and Jaakkola, Tommi
Abstract
This is problematic when features lack clear linguistic meaning as in embeddings or when the information is blended across features.
Introduction
First, features may lack clear linguistic interpretation as in distributional features or continuous vector embeddings of words.
Introduction
0 Our low dimensional embeddings are tailored to the syntactic context of words (head, modifier).
Problem Formulation
By learning parameters U, V, and W that function well in dependency parsing, we also learn context-dependent embeddings for words and arcs.
Related Work
Word-level vector space embeddings have so far had limited impact on parsing performance.
Related Work
This framework enables us to learn new syntactically guided embeddings while also leveraging separately estimated word vectors as starting features, leading to improved parsing performance.
Results
For this purpose, we train a model with only a tensor component (such that it has to learn an accurate tensor) on the English dataset and obtain low dimensional embeddings U gbw and ngw for each word.
Results
The upper part shows our learned embeddings group words with similar syntactic behavior.
embeddings is mentioned in 8 sentences in this paper.
Topics mentioned in this paper:
Liu, Shujie and Yang, Nan and Li, Mu and Zhou, Ming
Experiments and Results
As we mentioned in Section 5, constructing phrase pair embeddings from word embeddings may be not suitable.
Experiments and Results
We first train the source and target word embeddings separately using large monolingual data, following (Collobert et al., 2011).
Experiments and Results
Ewms and Emma/9) are the monolingual word embeddings, and Ewbs(si) and Emma/9) are the bilingual word embeddings .
Model Training
Back propagation is performed along the tree structure, and the phrase pair embeddings of the leaf nodess are updated.
Phrase Pair Embedding
A simple approach to construct phrase pair embedding is to use the average of the embeddings of the words in the phrase pair.
Phrase Pair Embedding
We use recurrent neural network to generate two smoothed translation confidence scores based on source and target word embeddings .
Related Work
Word embeddings capturing lexical translation information and surrounding words modeling context information are leveraged to improve the word alignment performance.
Related Work
RNNLM (Mikolov et al., 2010) is firstly used to generate the source and target word embeddings , which are fed into a one-hidden-layer neural network to get a translation confidence score.
embeddings is mentioned in 8 sentences in this paper.
Topics mentioned in this paper:
Srivastava, Shashank and Hovy, Eduard
Abstract
Hellinger PCA embeddings learnt using the framework show competitive results on empirical tasks.
Introduction
While word embeddings and language models from such methods have been useful for tasks such as relation classification, polarity detection, event coreference and parsing; much of existing literature on composition is based on abstract linguistic theory and conjecture, and there is little evidence to support that learnt representations for larger linguistic units correspond to their semantic meanings.
Introduction
While this framework is attractive in the lack of assumptions on representation that it makes, the use of distributional embeddings for individual tokens means
Introduction
Recent work (Lebret and Lebret, 2013) has shown that the Hellinger distance is an especially effective measure in learning distributional embeddings , with Hellinger PCA being much more computationally inexpensive than neural language modeling approaches, while performing much better than standard PCA, and competitive with the state-of-the-art in downstream evaluations.
embeddings is mentioned in 8 sentences in this paper.
Topics mentioned in this paper:
Ma, Ji and Zhang, Yue and Zhu, Jingbo
Experiments
This result illustrates that the ngram-level knowledge captures more complex interactions of the web text, which cannot be recovered by using only word embeddings .
Experiments
(2012), who found that using both the word embeddings and the hidden units of a trigram WRRBM as additional features for a CRF chunker yields larger improvements than using word embeddings only.
Related Work
(2010) learn word embeddings to improve the performance of in-domain POS tagging, named entity recognition, chunking and semantic role labelling.
Related Work
(2013) induce bilingual word embeddings for word alignment.
Related Work
(2013) investigate Chinese character embeddings for joint word segmentation and POS tagging.
embeddings is mentioned in 7 sentences in this paper.
Topics mentioned in this paper:
Kalchbrenner, Nal and Grefenstette, Edward and Blunsom, Phil
Background
These generally consist of a projection layer that maps words, sub-word units or n-grams to high dimensional embeddings ; the latter are then combined component-wise with an operation such as summation.
Convolutional Neural Networks with Dynamic k-Max Pooling
Word embeddings have size d = 4.
Convolutional Neural Networks with Dynamic k-Max Pooling
The values in the embeddings wi are parameters that are op-timised during training.
Experiments
The set of parameters comprises the word embeddings , the filter weights and the weights from the fully connected layers.
Experiments
As the dataset is rather small, we use lower-dimensional word vectors with d = 32 that are initialised with embeddings trained in an unsupervised way to predict contexts of occurrence (Turian et al., 2010).
Experiments
The randomly initialised word embeddings are increased in length to a dimension of d = 60.
embeddings is mentioned in 6 sentences in this paper.
Topics mentioned in this paper:
Ji, Yangfeng and Eisenstein, Jacob
Experiments
Another way to construct B is to use neural word embeddings (Collobert and Weston, 2008).
Experiments
In this case, we can view the product Bv as a composition of the word embeddings , using the simple additive composition model proposed by Mitchell
Experiments
We used the word embeddings from Collobert and Weston (2008) with dimension {25, 50, 100}.
embeddings is mentioned in 5 sentences in this paper.
Topics mentioned in this paper:
Silberer, Carina and Lapata, Mirella
Abstract
We introduce a new model which uses stacked autoencoders to learn higher-level embeddings from textual and visual input.
Experimental Setup
Finally, we also compare to the word embeddings obtained using Mikolov et al.’s (2011) recurrent neural network based language model.
Experimental Setup
These were pre-trained on Broadcast news data (400M words) using the word2vec tool.8 We report results with the 640-dimensional embeddings as they performed best.
Introduction
We evaluate the embeddings it produces on two tasks, namely word similarity and categorization.
Results
This indicates that higher level embeddings may be beneficial to NLP tasks in general, not only to those requiring multimodal information.
embeddings is mentioned in 5 sentences in this paper.
Topics mentioned in this paper:
Iyyer, Mohit and Enns, Peter and Boyd-Graber, Jordan and Resnik, Philip
Experiments
o LR-(W2V) is a logistic regression model trained on the average of the pretrained word embeddings for each sentence (Section 2.2).
Experiments
0 RNN2-(W2V) is initialized using word2vec embeddings and also includes annotated phrase labels in its training.
Recursive Neural Networks
The word2vec embeddings have linear relationships (e.g., the closest vectors to the average of
Where Compositionality Helps Detect Ideological Bias
Initializing the RNN We matrix with word2vec embeddings improves accuracy over randomly initialization by 1%.
embeddings is mentioned in 4 sentences in this paper.
Topics mentioned in this paper:
Bamman, David and Dyer, Chris and Smith, Noah A.
Model
The first is the representation matrix W 6 RM”, which encodes the real-valued embeddings for each word in the vocabulary.
Model
Backpropagation using (input :5, output 3/) word tuples learns the values of W (the embeddings ) and X (the output parameter matrix) that maximize the likelihood of y (i.e., the context words) conditioned on cc (i.e., the 31’s).
Model
Given an input word w and set of active variable values A (e.g., A 2 {state 2 MA}), we calculate the hidden layer h as the sum of these independent embeddings : h = wTWmam + 26,64 wTWa.
embeddings is mentioned in 4 sentences in this paper.
Topics mentioned in this paper: