Abstract | We propose Bilingually-constrained Recursive Auto-encoders (BRAE) to learn semantic phrase embeddings (compact vector representations for phrases), which can distinguish the phrases with different semantic meanings. |
Introduction | The models using word embeddings as the direct inputs to DNN cannot make full use of the whole syntactic and semantic information of the phrasal translation rules. |
Introduction | (2011) make the phrase embeddings capture the sentiment information. |
Introduction | (2013a) enable the phrase embeddings to mainly capture the syntactic knowledge. |
Abstract | This paper proposes a novel and effective method for the construction of semantic hierarchies based on word embeddings , which can be used to measure the semantic relationship between words. |
Background | In this paper, we aim to identify hypemym—hyponym relations using word embeddings , which have been shown to preserve good properties for capturing semantic relationship between words. |
Introduction | This paper proposes a novel approach for semantic hierarchy construction based on word embeddings . |
Introduction | Word embeddings , also known as distributed word representations, typically represent words with dense, low-dimensional and real-valued vectors. |
Introduction | Word embeddings have been empirically shown to preserve linguistic regularities, such as the semantic relationship between words (Mikolov et al., 2013b). |
Method | Various models for learning word embeddings have been proposed, including neural net language models (Bengio et al., 2003; Mnih and Hinton, 2008; Mikolov et al., 2013b) and spectral models (Dhillon et al., 2011). |
Method | (2013a) propose two log-linear models, namely the Skip-gram and CBOW model, to efficiently induce word embeddings . |
Method | Therefore, we employ the Skip- gram model for estimating word embeddings in this study. |
Abstract | We present a novel technique for learning semantic representations, which extends the distributional hypothesis to multilingual data and joint-space embeddings . |
Abstract | Our models leverage parallel data and learn to strongly align the embeddings of semantically equivalent sentences, while maintaining sufficient distance between those of dissimilar sentences. |
Experiments | We also investigate the learned embeddings from a qualitative perspective in §5.4. |
Experiments | All our embeddings have dimensionality d=128, with the margin set to m=d.6 Further, we use L2 regularization with A21 and step-size in {001,005}. |
Experiments | This task involves learning language independent embeddings which are then used for document classification across the English-German language pair. |
Introduction | Such word embeddings are naturally richer representations than those of symbolic or discrete models, and have been shown to be able to capture both syntactic and semantic information. |
Introduction | this work, we extend this hypothesis to multilingual data and joint-space embeddings . |
Overview | We describe a multilingual objective function that uses a noise-contrastive update between semantic representations of different languages to learn these word embeddings . |
Abstract | By exploiting tag embeddings and tensor-based transformation, MMTNN has the ability to model complicated interactions between tags and context characters. |
Conventional Neural Network | The character embeddings are then stacked into a embedding matrix M 6 1Rde |. |
Conventional Neural Network | We will analyze in more detail about the effect of character embeddings in Section 4. |
Conventional Neural Network | The character embeddings extracted by the Lookup Table layer are then concatenated into a single vector a 6 Km, where H1 2 w - d is the size of Layer 1. |
Introduction | between tags and context characters by exploiting tag embeddings and tensor-based transformation. |
Max-Margin Tensor Neural Network | Similar to character embeddings, given a fixed-sized tag set T, the tag embeddings for tags are stored in a tag embedding matrix L E Rdx m, where d is the dimensionality |
Max-Margin Tensor Neural Network | of the vector space (same with character embeddings ). |
Max-Margin Tensor Neural Network | The tag embeddings start from a random initialization and can be automatically trained by back-propagation. |
Abstract | Do continuous word embeddings encode any useful information for constituency parsing? |
Abstract | We isolate three ways in which word embeddings might augment a state-of-the-art statistical parser: by connecting out-of-vocabulary words to known ones, by encouraging common behavior among related in-vocabulary words, and by directly providing features for the lexicon. |
Abstract | Despite small gains on extremely small supervised training sets, we find that extra information from embeddings appears to make little or no difference to a parser with adequate training data. |
Introduction | This paper investigates a variety of ways in which word embeddings might augment a constituency parser with a discrete state space. |
Introduction | While word embeddings can be constructed directly from surface distributional statistics, as in LSA, more sophisticated tools for unsupervised extraction of word representations have recently gained popularity (Collobert et al., 2011; Mikolov et al., 2013a). |
Introduction | (Turian et al., 2010) have been shown to benefit from the inclusion of word embeddings as features. |
Abstract | This paper evaluates word embeddings and clustering on adapting feature-based relation extraction systems. |
Abstract | We systematically explore various ways to apply word embeddings and show the best adaptation improvement by combining word cluster and word embedding information. |
Introduction | valued features of words (such as word embeddings (Mnih and Hinton, 2007; Collobert and Weston, 2008)) effectively. |
Introduction | ing word embeddings (Bengio et al., 2001; Bengio et al., 2003; Mnih and Hinton, 2007; Collobert and Weston, 2008; Turian et al., 2010) on feature-based methods to adapt RE systems to new domains. |
Introduction | More importantly, we show empirically that word embeddings and word clusters capture different information and their combination would further improve the adaptability of relation extractors. |
Related Work | Although word embeddings have been successfully employed in many NLP tasks (Collobert and Weston, 2008; Turian et al., 2010; Maas and Ng, 2010), the application of word embeddings in RE is very recent. |
Related Work | (2010) propose an abstraction-augmented string kernel for bio-relation extraction via word embeddings . |
Related Work | (2012) and Khashabi (2013) use pre-trained word embeddings as input for Matrix-Vector Recursive Neural Networks (MV—RNN) to learn compositional structures for RE. |
Introduction | Specifically, our training encourages word embeddings to be consistent across alignment directions by introducing a penalty term that expresses the difference between embedding of words into an objective function. |
RNN-based Alignment Model | In the lookup layer, each of these words is converted to its word embedding, and then the concatenation of the two embeddings (any) is fed to the hidden layer in the same manner as the FFNN-based model. |
Related Work | Word embeddings are dense, low dimensional, and real-valued vectors that can capture syntactic and semantic properties of the words (Bengio et al., 2003). |
Training | The constraint concretely enforces agreement in word embeddings of both directions. |
Training | The proposed method trains two directional models concurrently based on the following objective by incorporating a penalty term that expresses the difference between word embeddings: |
Training | where QFE (or 6gp) denotes the weights of layers in a source-to-target (or target-to-source) alignment model, 6,; denotes weights of a lookup layer, i.e., word embeddings , and 04 is a parameter that controls the strength of the agreement constraint. |
Abstract | The word embeddings are used during the leam-ing process, but the final decoder that the learning algorithm outputs maps a POS tag sequence a: to a parse tree. |
Abstract | Then, latent states are generated for each bracket, and finally, the latent states at the yield of the bracketing parse tree generate the words of the sentence (in the form of embeddings ). |
Abstract | L€t V I: {101, ..., mg, 21, ..., 2H}, With 20,- 1‘61)-resenting the word embeddings , and 21- representing the latent states of the bracketings. |
Abstract | We present a novel technique for semantic frame identification using distributed representations of predicates and their syntactic context; this technique leverages automatic syntactic parses and a generic set of word embeddings . |
Experiments | The second baseline, tries to decouple the WSABIE training from the embedding input, and trains a log linear model using the embeddings . |
Experiments | Hyperparameters For our frame identification model with embeddings , we search for the WSABIE hyperparameters using the development data. |
Frame Identification with Embeddings | First, we extract the words in the syntactic context of runs; next, we concatenate their word embeddings as described in §2.2 to create an initial vector space representation. |
Frame Identification with Embeddings | Formally, let cc represent the actual sentence with a marked predicate, along with the associated syntactic parse tree; let our initial representation of the predicate context be Suppose that the word embeddings we start with are of dimension n. Then 9 is a function from a parsed sentence cc to Rm“, where k is the number of possible syntactic context types. |
Frame Identification with Embeddings | So for example “He runs the company” could help the model disambiguate “He owns the company.” Moreover, since g(:c) relies on word embeddings rather than word identities, information is shared between words. |
Overview | We present a model that takes word embeddings as input and learns to identify semantic frames. |
Overview | We use word embeddings to represent the syntactic context of a particular predicate instance as a vector. |
Related Work | t. The embeddings of C&W (Collobert et al., 2011), word2vec4, WVSA (Maas et al., 2011) and our models are trained with the same dataset and same parameter setting. |
Related Work | ReEmb(C&W) and ReEmb(w2v) stand for the use of embeddings learned from 10 million distant-supervised tweets with C&W and word2vec, respectively. |
Related Work | Table 3: Macro-F1 on positive/negative classification of tweets with different word embeddings . |
Experiments and Results | As we mentioned in Section 5, constructing phrase pair embeddings from word embeddings may be not suitable. |
Experiments and Results | We first train the source and target word embeddings separately using large monolingual data, following (Collobert et al., 2011). |
Experiments and Results | Ewms and Emma/9) are the monolingual word embeddings, and Ewbs(si) and Emma/9) are the bilingual word embeddings . |
Model Training | Back propagation is performed along the tree structure, and the phrase pair embeddings of the leaf nodess are updated. |
Phrase Pair Embedding | A simple approach to construct phrase pair embedding is to use the average of the embeddings of the words in the phrase pair. |
Phrase Pair Embedding | We use recurrent neural network to generate two smoothed translation confidence scores based on source and target word embeddings . |
Related Work | Word embeddings capturing lexical translation information and surrounding words modeling context information are leveraged to improve the word alignment performance. |
Related Work | RNNLM (Mikolov et al., 2010) is firstly used to generate the source and target word embeddings , which are fed into a one-hidden-layer neural network to get a translation confidence score. |
Abstract | Hellinger PCA embeddings learnt using the framework show competitive results on empirical tasks. |
Introduction | While word embeddings and language models from such methods have been useful for tasks such as relation classification, polarity detection, event coreference and parsing; much of existing literature on composition is based on abstract linguistic theory and conjecture, and there is little evidence to support that learnt representations for larger linguistic units correspond to their semantic meanings. |
Introduction | While this framework is attractive in the lack of assumptions on representation that it makes, the use of distributional embeddings for individual tokens means |
Introduction | Recent work (Lebret and Lebret, 2013) has shown that the Hellinger distance is an especially effective measure in learning distributional embeddings , with Hellinger PCA being much more computationally inexpensive than neural language modeling approaches, while performing much better than standard PCA, and competitive with the state-of-the-art in downstream evaluations. |
Abstract | This is problematic when features lack clear linguistic meaning as in embeddings or when the information is blended across features. |
Introduction | First, features may lack clear linguistic interpretation as in distributional features or continuous vector embeddings of words. |
Introduction | 0 Our low dimensional embeddings are tailored to the syntactic context of words (head, modifier). |
Problem Formulation | By learning parameters U, V, and W that function well in dependency parsing, we also learn context-dependent embeddings for words and arcs. |
Related Work | Word-level vector space embeddings have so far had limited impact on parsing performance. |
Related Work | This framework enables us to learn new syntactically guided embeddings while also leveraging separately estimated word vectors as starting features, leading to improved parsing performance. |
Results | For this purpose, we train a model with only a tensor component (such that it has to learn an accurate tensor) on the English dataset and obtain low dimensional embeddings U gbw and ngw for each word. |
Results | The upper part shows our learned embeddings group words with similar syntactic behavior. |
Experiments | This result illustrates that the ngram-level knowledge captures more complex interactions of the web text, which cannot be recovered by using only word embeddings . |
Experiments | (2012), who found that using both the word embeddings and the hidden units of a trigram WRRBM as additional features for a CRF chunker yields larger improvements than using word embeddings only. |
Related Work | (2010) learn word embeddings to improve the performance of in-domain POS tagging, named entity recognition, chunking and semantic role labelling. |
Related Work | (2013) induce bilingual word embeddings for word alignment. |
Related Work | (2013) investigate Chinese character embeddings for joint word segmentation and POS tagging. |
Background | These generally consist of a projection layer that maps words, sub-word units or n-grams to high dimensional embeddings ; the latter are then combined component-wise with an operation such as summation. |
Convolutional Neural Networks with Dynamic k-Max Pooling | Word embeddings have size d = 4. |
Convolutional Neural Networks with Dynamic k-Max Pooling | The values in the embeddings wi are parameters that are op-timised during training. |
Experiments | The set of parameters comprises the word embeddings , the filter weights and the weights from the fully connected layers. |
Experiments | As the dataset is rather small, we use lower-dimensional word vectors with d = 32 that are initialised with embeddings trained in an unsupervised way to predict contexts of occurrence (Turian et al., 2010). |
Experiments | The randomly initialised word embeddings are increased in length to a dimension of d = 60. |
Abstract | We introduce a new model which uses stacked autoencoders to learn higher-level embeddings from textual and visual input. |
Experimental Setup | Finally, we also compare to the word embeddings obtained using Mikolov et al.’s (2011) recurrent neural network based language model. |
Experimental Setup | These were pre-trained on Broadcast news data (400M words) using the word2vec tool.8 We report results with the 640-dimensional embeddings as they performed best. |
Introduction | We evaluate the embeddings it produces on two tasks, namely word similarity and categorization. |
Results | This indicates that higher level embeddings may be beneficial to NLP tasks in general, not only to those requiring multimodal information. |
Experiments | Another way to construct B is to use neural word embeddings (Collobert and Weston, 2008). |
Experiments | In this case, we can view the product Bv as a composition of the word embeddings , using the simple additive composition model proposed by Mitchell |
Experiments | We used the word embeddings from Collobert and Weston (2008) with dimension {25, 50, 100}. |
Experiments | o LR-(W2V) is a logistic regression model trained on the average of the pretrained word embeddings for each sentence (Section 2.2). |
Experiments | 0 RNN2-(W2V) is initialized using word2vec embeddings and also includes annotated phrase labels in its training. |
Recursive Neural Networks | The word2vec embeddings have linear relationships (e.g., the closest vectors to the average of |
Where Compositionality Helps Detect Ideological Bias | Initializing the RNN We matrix with word2vec embeddings improves accuracy over randomly initialization by 1%. |
Model | The first is the representation matrix W 6 RM”, which encodes the real-valued embeddings for each word in the vocabulary. |
Model | Backpropagation using (input :5, output 3/) word tuples learns the values of W (the embeddings ) and X (the output parameter matrix) that maximize the likelihood of y (i.e., the context words) conditioned on cc (i.e., the 31’s). |
Model | Given an input word w and set of active variable values A (e.g., A 2 {state 2 MA}), we calculate the hidden layer h as the sum of these independent embeddings : h = wTWmam + 26,64 wTWa. |