Index of papers in Proc. ACL that mention
  • word embeddings
Huang, Eric and Socher, Richard and Manning, Christopher and Ng, Andrew
Abstract
We present a new neural network architecture which 1) learns word embeddings that better capture the semantics of words by incorporating both local and global document context, and 2) accounts for homonymy and polysemy by learning multiple embeddings per word.
Experiments
Table 2: Nearest neighbors of word embeddings learned by our model using the multi-prototype approach based on cosine similarity.
Experiments
Table 3: Spearman’s p correlation on WordSim—353, showing our model’s improvement over previous neural models for learning word embeddings .
Experiments
C&W* is the word embeddings trained and provided by C&W.
Global Context-Aware Neural Language Model
ng = Z maX(0, 1 — 9(8, d) + g(sw, d)) (l) wEV Collobert and Weston (2008) showed that this ranking approach can produce good word embeddings that are useful in several NLP tasks, and allows much faster training of the model compared to optimizing log-likelihood of the next word.
Global Context-Aware Neural Language Model
where [£131,132, ...,;vm] is the concatenation of the m word embeddings representing sequence 8, f is an element-wise activation function such as tanh, a1 6 Rh“ is the activation of the hidden layer with h hidden nodes, W1 6 WW) and W2 6 nlxh are respectively the first and second layer weights of the neural network, and b1, ()2 are the biases of each layer.
Global Context-Aware Neural Language Model
For the score of the global context, we represent the document also as an ordered list of word embeddings , d 2 (d1, d2, ..., dk).
Related Work
Our model uses a similar neural network architecture as these models and uses the ranking-loss training objective proposed by Collobert and Weston (2008), but introduces a new way to combine local and global context to train word embeddings .
Related Work
Besides language modeling, word embeddings induced by neural language models have been useful in chunking, NER (Turian et al., 2010), parsing (Socher et al., 201 lb), sentiment analysis (Socher et al., 2011c) and paraphrase detection (Socher et al., 2011a).
word embeddings is mentioned in 13 sentences in this paper.
Topics mentioned in this paper:
Zhang, Jiajun and Liu, Shujie and Li, Mu and Zhou, Ming and Zong, Chengqing
Bilingually-constrained Recursive Auto-encoders
After learning word embeddings with DNN (Bengio et al., 2003; Collobert and Weston, 2008; Mikolov et al., 2013), each word in the vocabulary V corresponds to a vector cc 6 R”, and all the vectors are stacked into an embedding matrix L E R“ M.
Bilingually-constrained Recursive Auto-encoders
Since word embeddings for two languages are learned separately and locate in different vector space, we do not enforce the phrase embeddings in two languages to be in the same semantic vector space.
Bilingually-constrained Recursive Auto-encoders
6 L: word embedding matrix L for two languages (Section 3.1.1);
Introduction
The models using word embeddings as the direct inputs to DNN cannot make full use of the whole syntactic and semantic information of the phrasal translation rules.
Introduction
Kalchbrenner and Blunsom (2013) utilize a simple convolution model to generate phrase embeddings from word embeddings .
Related Work
One method considers the phrases as bag-of-words and employs a convolution model to transform the word embeddings to phrase embeddings (Collobert et al., 2011; Kalchbrenner and Blunsom, 2013).
Related Work
Instead, our bilingually-constrained recursive auto-encoders not only learn the composition mechanism of generating phrases from words, but also fine tune the word embeddings during the model training stage, so that we can induce the full information of the phrases and internal words.
word embeddings is mentioned in 14 sentences in this paper.
Topics mentioned in this paper:
Tang, Duyu and Wei, Furu and Yang, Nan and Zhou, Ming and Liu, Ting and Qin, Bing
Abstract
We present a method that learns word embedding for Twitter sentiment classification in this paper.
Abstract
We address this issue by learning sentiment-specific word embedding (SSWE), which encodes sentiment information in the continuous representation of words.
Abstract
To obtain large scale training corpora, we learn the sentiment-specific word embedding from massive distant-supervised tweets collected by positive and negative emoticons.
Introduction
Accordingly, it is a crucial step to learn the word representation (or word embedding ), which is a dense, low-dimensional and real-valued vector for a word.
Introduction
Although existing word embedding leam-ing algorithms (Collobert et al., 2011; Mikolov et al., 2013) are intuitive choices, they are not effective enough if directly used for sentiment classification.
Introduction
In this paper, we propose learning sentiment-specific word embedding (SSWE) for sentiment analysis.
word embeddings is mentioned in 46 sentences in this paper.
Topics mentioned in this paper:
Tamura, Akihiro and Watanabe, Taro and Sumita, Eiichiro
Abstract
To overcome this limitation, we encourage agreement between the two directional models by introducing a penalty function that ensures word embedding consistency across two directional models during training.
Introduction
Specifically, our training encourages word embeddings to be consistent across alignment directions by introducing a penalty term that expresses the difference between embedding of words into an objective function.
RNN-based Alignment Model
In the lookup layer, each of these words is converted to its word embedding , and then the concatenation of the two embeddings (any) is fed to the hidden layer in the same manner as the FFNN-based model.
Related Work
First, the lookup layer converts each input word into its word embedding by looking up its corresponding column in the embedding matrix (L), and then concatenates them.
Related Work
Word embeddings are dense, low dimensional, and real-valued vectors that can capture syntactic and semantic properties of the words (Bengio et al., 2003).
Training
The constraint concretely enforces agreement in word embeddings of both directions.
Training
The proposed method trains two directional models concurrently based on the following objective by incorporating a penalty term that expresses the difference between word embeddings:
Training
where QFE (or 6gp) denotes the weights of layers in a source-to-target (or target-to-source) alignment model, 6,; denotes weights of a lookup layer, i.e., word embeddings , and 04 is a parameter that controls the strength of the agreement constraint.
word embeddings is mentioned in 13 sentences in this paper.
Topics mentioned in this paper:
Nguyen, Thien Huu and Grishman, Ralph
Abstract
This paper evaluates word embeddings and clustering on adapting feature-based relation extraction systems.
Abstract
We systematically explore various ways to apply word embeddings and show the best adaptation improvement by combining word cluster and word embedding information.
Introduction
valued features of words (such as word embeddings (Mnih and Hinton, 2007; Collobert and Weston, 2008)) effectively.
Introduction
ing word embeddings (Bengio et al., 2001; Bengio et al., 2003; Mnih and Hinton, 2007; Collobert and Weston, 2008; Turian et al., 2010) on feature-based methods to adapt RE systems to new domains.
Introduction
We explore the embedding-based features in a principled way and demonstrate that word embedding itself is also an effective representation for domain adaptation of RE.
Related Work
Although word embeddings have been successfully employed in many NLP tasks (Collobert and Weston, 2008; Turian et al., 2010; Maas and Ng, 2010), the application of word embeddings in RE is very recent.
Related Work
(2010) propose an abstraction-augmented string kernel for bio-relation extraction via word embeddings .
Related Work
(2012) and Khashabi (2013) use pre-trained word embeddings as input for Matrix-Vector Recursive Neural Networks (MV—RNN) to learn compositional structures for RE.
word embeddings is mentioned in 30 sentences in this paper.
Topics mentioned in this paper:
Liu, Shujie and Yang, Nan and Li, Mu and Zhou, Ming
Introduction
Word embedding is a dense, low dimensional, real-valued vector.
Introduction
Word embedding is usually learnt from large amount of monolingual corpus at first, and then fine tuned for special distinct tasks.
Introduction
In their work, bilingual word embedding is trained to capture lexical translation information, and surrounding words are utilized to model context information.
Our Model
Word embedding act is integrated with
Our Model
The new history ht is used for the future prediction, and updated with new information from word embedding :ct recurrently.
Our Model
Word embedding act is integrated as new input information in recurrent neural networks for each prediction, but in recursive neural networks, no additional input information is used except the two representation vectors of the child nodes.
Related Work
In their work, initial word embedding is firstly trained with a huge monolingual corpus, then the word embedding is adapted and fine tuned bilingually in a context-depended DNN HMM framework.
Related Work
Word embeddings capturing lexical translation information and surrounding words modeling context information are leveraged to improve the word alignment performance.
Related Work
In their work, not only the target word embedding is used as the input of the network, but also the embedding of the source word, which is aligned to the current target word.
word embeddings is mentioned in 26 sentences in this paper.
Topics mentioned in this paper:
Fu, Ruiji and Guo, Jiang and Qin, Bing and Che, Wanxiang and Wang, Haifeng and Liu, Ting
Abstract
This paper proposes a novel and effective method for the construction of semantic hierarchies based on word embeddings , which can be used to measure the semantic relationship between words.
Background
In this paper, we aim to identify hypemym—hyponym relations using word embeddings , which have been shown to preserve good properties for capturing semantic relationship between words.
Introduction
This paper proposes a novel approach for semantic hierarchy construction based on word embeddings .
Introduction
Word embeddings , also known as distributed word representations, typically represent words with dense, low-dimensional and real-valued vectors.
Introduction
Word embeddings have been empirically shown to preserve linguistic regularities, such as the semantic relationship between words (Mikolov et al., 2013b).
Method
Then we elaborate on our proposed method composed of three major steps, namely, word embedding training, projection learning, and hypernym—hyponym relation identification.
Method
3.2 Word Embedding Training
Method
Various models for learning word embeddings have been proposed, including neural net language models (Bengio et al., 2003; Mnih and Hinton, 2008; Mikolov et al., 2013b) and spectral models (Dhillon et al., 2011).
word embeddings is mentioned in 25 sentences in this paper.
Topics mentioned in this paper:
Andreas, Jacob and Klein, Dan
Abstract
Do continuous word embeddings encode any useful information for constituency parsing?
Abstract
We isolate three ways in which word embeddings might augment a state-of-the-art statistical parser: by connecting out-of-vocabulary words to known ones, by encouraging common behavior among related in-vocabulary words, and by directly providing features for the lexicon.
Abstract
Our results support an overall hypothesis that word embeddings import syntactic information that is ultimately redundant with distinctions learned from tree-banks in other ways.
Introduction
This paper investigates a variety of ways in which word embeddings might augment a constituency parser with a discrete state space.
Introduction
While word embeddings can be constructed directly from surface distributional statistics, as in LSA, more sophisticated tools for unsupervised extraction of word representations have recently gained popularity (Collobert et al., 2011; Mikolov et al., 2013a).
Introduction
(Turian et al., 2010) have been shown to benefit from the inclusion of word embeddings as features.
word embeddings is mentioned in 25 sentences in this paper.
Topics mentioned in this paper:
Yang, Nan and Liu, Shujie and Li, Mu and Zhou, Ming and Yu, Nenghai
Abstract
We describe in detail how we adapt and extend the CD-DNN-HMM (Dahl et al., 2012) method introduced in speech recognition to the HMM-based word alignment model, in which bilingual word embedding is discrimina-tively learnt to capture lexical translation information, and surrounding words are leveraged to model context information in bilingual sentences.
DNN structures for NLP
To apply DNN to NLP task, the first step is to transform a discrete word into its word embedding , a low dimensional, dense, real-valued vector (Bengio et al., 2006).
DNN structures for NLP
Word embeddings often implicitly encode syntactic or semantic knowledge of the words.
DNN structures for NLP
Assuming a finite sized vocabulary V, word embeddings form a (L x |V|)-dimension embedding matrix WV, where L is a predetermined embedding length; mapping words to embeddings is done by simply looking up their respective columns in the embedding matrix WV.
Introduction
Most works convert atomic lexical entries into a dense, low dimensional, real-valued representation, called word embedding ; Each dimension represents a latent aspect of a word, capturing its semantic and syntactic properties (Bengio et al., 2006).
Introduction
Word embedding is usually first learned from huge amount of monolingual texts, and then fine-tuned with task-specific objectives.
Introduction
As we mentioned in the last paragraph, word embedding (trained with huge monolingual texts) has the ability to map a word into a vector space, in which, similar words are near each other.
Related Work
Most methods using DNN in NLP start with a word embedding phase, which maps words into a fixed length, real valued vectors.
Related Work
(Titov et al., 2012) learns a context-free cross-lingual word embeddings to facilitate cross-lingual information retrieval.
Training
Tunable parameters in neural network alignment model include: word embeddings in lookup table LT, parameters Wl, bl for linear transformations in the hidden layers of the neural network, and distortion parameters 3d of jump distance.
word embeddings is mentioned in 23 sentences in this paper.
Topics mentioned in this paper:
liu, lemao and Watanabe, Taro and Sumita, Eiichiro and Zhao, Tiejun
Abstract
In addition, word embedding is employed as the input to the neural network, which encodes each word as a feature vector.
Introduction
We also integrate word embedding into the model by representing each word as a feature vector (Collobert and Weston, 2008).
Introduction
For the local feature vector h’ in Eq (5), we employ word embedding features as described in the following subsection.
Introduction
3.3 Word Embedding features for AdNN
word embeddings is mentioned in 15 sentences in this paper.
Topics mentioned in this paper:
Turian, Joseph and Ratinov, Lev-Arie and Bengio, Yoshua
Distributed representations
Distributed word representations are called word embeddings .
Distributed representations
Word embeddings are typically induced using neural language models, which use neural networks as the underlying predictive model (Bengio, 2008).
Introduction
word embeddings using unsupervised approaches.
Supervised evaluation tasks
The word embeddings also required a scaling hyperparameter, as described in Section 7.2.
Unlabled Data
For rare words, which are typically updated only 143 times per epochz, and given that our embedding learning rate was typically 1e-6 or 1e-7, this means that rare word embeddings will be concentrated around zero, instead of spread out randomly.
Unlabled Data
7.2 Scaling of Word Embeddings
Unlabled Data
The word embeddings , however, are real numbers that are not necessarily in a bounded range.
word embeddings is mentioned in 17 sentences in this paper.
Topics mentioned in this paper:
Xu, Liheng and Liu, Kang and Lai, Siwei and Zhao, Jun
Experiments
The dimension of word embedding n = 100, the convergence threshold 5 = 10—7, and the number of expanded seeds T = 40.
Experiments
In contrast, CONT exploits latent semantics of each word in context, and LEX takes advantage of word embedding , which is induced from global word co-occurrence statistic.
The Proposed Method
To capture lexical semantic clue, each word is first converted into word embedding , which is a continuous vector with each dimension’s value corresponds to a semantic or grammatical interpretation (Turian et al., 2010).
The Proposed Method
Learning large-scale word embeddings is very time-consuming (Collobert et al., 2011), we thus employ a faster method named Skip-gram model (Mikolov et al., 2013).
The Proposed Method
3.2.1 Learning Word Embedding for Semantic Representation
word embeddings is mentioned in 10 sentences in this paper.
Topics mentioned in this paper:
Hermann, Karl Moritz and Das, Dipanjan and Weston, Jason and Ganchev, Kuzman
Abstract
We present a novel technique for semantic frame identification using distributed representations of predicates and their syntactic context; this technique leverages automatic syntactic parses and a generic set of word embeddings .
Experiments
as described in §3.l but conjoins them with the word identity rather than a word embedding .
Frame Identification with Embeddings
First, we extract the words in the syntactic context of runs; next, we concatenate their word embeddings as described in §2.2 to create an initial vector space representation.
Frame Identification with Embeddings
Formally, let cc represent the actual sentence with a marked predicate, along with the associated syntactic parse tree; let our initial representation of the predicate context be Suppose that the word embeddings we start with are of dimension n. Then 9 is a function from a parsed sentence cc to Rm“, where k is the number of possible syntactic context types.
Frame Identification with Embeddings
So for example “He runs the company” could help the model disambiguate “He owns the company.” Moreover, since g(:c) relies on word embeddings rather than word identities, information is shared between words.
Overview
We present a model that takes word embeddings as input and learns to identify semantic frames.
Overview
A word embedding is a distributed representation of meaning where each word is represented as a vector in R”.
Overview
We use word embeddings to represent the syntactic context of a particular predicate instance as a vector.
word embeddings is mentioned in 8 sentences in this paper.
Topics mentioned in this paper:
Hermann, Karl Moritz and Blunsom, Phil
Conclusion
To summarize, we have presented a novel method for learning multilingual word embeddings using parallel data in conjunction with a multilingual objective function for compositional vector models.
Experiments
(2013), who published word embeddings across 100 languages, including all languages considered in this paper.
Experiments
While the classification experiments focused on establishing the semantic content of the sentence level representations, we also want to briefly investigate the induced word embeddings .
Introduction
Such word embeddings are naturally richer representations than those of symbolic or discrete models, and have been shown to be able to capture both syntactic and semantic information.
Overview
We describe a multilingual objective function that uses a noise-contrastive update between semantic representations of different languages to learn these word embeddings .
Related Work
In their simplest form, distributional information from large corpora can be used to learn embeddings, where the words appearing within a certain window of the target word are used to compute that word’s embedding .
Related Work
(2011) further popularised using neural network architectures for learning word embeddings from large amounts of largely unlabelled data by showing the embeddings can then be used to improve standard supervised tasks.
word embeddings is mentioned in 7 sentences in this paper.
Topics mentioned in this paper:
Ji, Yangfeng and Eisenstein, Jacob
Experiments
Another way to construct B is to use neural word embeddings (Collobert and Weston, 2008).
Experiments
In this case, we can view the product Bv as a composition of the word embeddings , using the simple additive composition model proposed by Mitchell
Experiments
We used the word embeddings from Collobert and Weston (2008) with dimension {25, 50, 100}.
word embeddings is mentioned in 6 sentences in this paper.
Topics mentioned in this paper:
Ma, Ji and Zhang, Yue and Zhu, Jingbo
Experiments
This result illustrates that the ngram-level knowledge captures more complex interactions of the web text, which cannot be recovered by using only word embeddings .
Experiments
(2012), who found that using both the word embeddings and the hidden units of a trigram WRRBM as additional features for a CRF chunker yields larger improvements than using word embeddings only.
Related Work
(2010) learn word embeddings to improve the performance of in-domain POS tagging, named entity recognition, chunking and semantic role labelling.
Related Work
(2013) induce bilingual word embeddings for word alignment.
Related Work
In particular, we both use a nonlinear layer to model complex relations underling word embeddings .
word embeddings is mentioned in 5 sentences in this paper.
Topics mentioned in this paper:
Iyyer, Mohit and Enns, Peter and Boyd-Graber, Jordan and Resnik, Philip
Experiments
o LR-(W2V) is a logistic regression model trained on the average of the pretrained word embeddings for each sentence (Section 2.2).
Recursive Neural Networks
The word-level vectors 5% and :35 come from a d x V dimensional word embedding matrix We, where V is the size of the vocabulary.
Recursive Neural Networks
Random The most straightforward choice is to initialize the word embedding matrix We and composition matrices WL and WR randomly such that without any training, representations for words and phrases are arbitrarily projected into the vector space.
Recursive Neural Networks
word2vec The other alternative is to initialize the word embedding matrix We with values that reflect the meanings of the associated word types.
word embeddings is mentioned in 4 sentences in this paper.
Topics mentioned in this paper:
Parikh, Ankur P. and Cohen, Shay B. and Xing, Eric P.
Abstract
The word embeddings are used during the leam-ing process, but the final decoder that the learning algorithm outputs maps a POS tag sequence a: to a parse tree.
Abstract
L€t V I: {101, ..., mg, 21, ..., 2H}, With 20,- 1‘61)-resenting the word embeddings , and 21- representing the latent states of the bracketings.
Abstract
Word embeddings As mentioned earlier, each 212,- can be an arbitrary feature vector.
word embeddings is mentioned in 4 sentences in this paper.
Topics mentioned in this paper:
Chen, Yanqing and Skiena, Steven
Related Work
proaches draft off of distributed word embedding which offer concise features reflecting the semantics of the underlying vocabulary.
Related Work
(2010) create powerful word embedding by training on real and corrupted phrases, optimizing for the replaceability of words.
Related Work
(2012) demonstrates a powerful approach to English sentiment using word embedding , which can easily be extended to other languages by training on appropriate text corpora.
word embeddings is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Kalchbrenner, Nal and Grefenstette, Edward and Blunsom, Phil
Convolutional Neural Networks with Dynamic k-Max Pooling
Word embeddings have size d = 4.
Experiments
The set of parameters comprises the word embeddings , the filter weights and the weights from the fully connected layers.
Experiments
The randomly initialised word embeddings are increased in length to a dimension of d = 60.
word embeddings is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Hermann, Karl Moritz and Blunsom, Phil
Experiments
Instead of initialising the model with external word embeddings , we first train it on a large amount of data with the aim of overcoming the sparsity issues encountered in the previous experiment.
Experiments
In this phase only the reconstruction signal is used to learn word embeddings and transformation matrices.
Experiments
By learning word embeddings and composition matrices on more data, the model is likely to gen-eralise better.
word embeddings is mentioned in 3 sentences in this paper.
Topics mentioned in this paper: