Abstract | If we take an existing supervised NLP system, a simple and general way to improve accuracy is to use unsupervised word representations as extra word features. |
Abstract | We use near state-of-the-art supervised baselines, and find that each of the three word representations improves the accuracy of these baselines. |
Abstract | We find further improvements by combining different word representations . |
Distributional representations | Distributional word representations are based upon a cooccurrence matrix F of size WXC, where W is the vocabulary size, each row F w is the initial representation of word w, and each column F c is some context. |
Distributional representations | Hyperspace Analogue to Language (HAL) is another early distributional approach (Lund et al., 1995; Lund & Burgess, 1996) to inducing word representations . |
Introduction | A word representation is a mathematical object associated with each word, often a vector. |
Introduction | These limitations of one-hot word representations have prompted researchers to investigate unsupervised methods for inducing word representations over large unlabeled corpora. |
Introduction | One common approach to inducing unsupervised word representation is to use clustering, perhaps hierarchical. |
Experiments | In both tasks we compare our model’s word representations with several bag of words weighting methods, and alternative approaches to word vector induction. |
Experiments | 4.1 Word Representation Learning |
Experiments | We induce word representations with our model using 25,000 movie reviews from IMDB. |
Introduction | Word representations are a critical component of many natural language processing systems. |
Our Model | To capture semantic similarities among words, we derive a probabilistic model of documents which learns word representations . |
Our Model | The energy function uses a word representation matrix R E R“ X M) where each word 21) (represented as a one-on vector) in the vocabulary V has a 6-dimensional vector representation gbw = Rw corresponding to that word’s column in R. The random variable 6 is also a B-dimensional vector, 6 E R5 which weights each of the 6 dimensions of words’ representation vectors. |
Our Model | We introduce a Frobenious norm regularization term for the word representation matrix R. The word biases b are not regularized reflecting the fact that we want the biases to capture whatever overall word frequency statistics are present in the data. |
Experiments | 2We have the same observation as Plank and Moschitti (2013) that when the gold-standard labels are used, the impact of word representations is limited since the gold-standard information seems to dominate. |
Experiments | However, whenever the gold labels are not available or inaccurate, the word representations would be useful for improving adaptability performance. |
Experiments | This section examines the effectiveness of word representations for RE across domains. |
Introduction | The application of word representations such |
Introduction | as word clusters in domain adaptation of RE (Plank and Moschitti, 2013) is motivated by its successes in semi-supervised methods (Chan and Roth, 2010; Sun et al., 2011) where word representations help to reduce data-sparseness of lexical information in the training data. |
Introduction | In DA terms, since the vocabularies of the source and target domains are usually different, word representations would mitigate the lexical sparsity by providing general features of words that are shared across domains, hence bridge the gap between domains. |
Regularization | Given the more general representations provided by word representations above, how can we learn a relation extractor from the labeled source domain data that generalizes well to new domains? |
Regularization | In fact, this setting can benefit considerably from our general approach of applying word representations and regularization. |
Word Representations | We consider two types of word representations and use them as additional features in our DA system, namely Brown word clustering (Brown et al., 1992) and word embeddings (Bengio et al., 2001). |
Abstract | Unsupervised word representations are very useful in NLP tasks both as inputs to learning algorithms and as extra word features in NLP systems. |
Conclusion | We presented a new neural network architecture that learns more semantic word representations by using both local and global context in learning. |
Experiments | In order to show that our model learns more semantic word representations with global context, we give the nearest neighbors of our single-prototype model versus C&W’s, which only uses local context. |
Global Context-Aware Neural Language Model | Our model jointly learns word representations while learning to discriminate the next word given a short word sequence (local context) and the document (global context) in which the word sequence occurs. |
Global Context-Aware Neural Language Model | Because our goal is to learn useful word representations and not the probability of the next word given previous words (which prohibits looking ahead), our model can utilize the entire document to provide |
Global Context-Aware Neural Language Model | The embedding matrix L is the word representations . |
Introduction | The model learns word representations that better capture the semantics of words, while still keeping syntactic information. |
Multi-Prototype Neural Language Model | Finally, each word occurrence in the corpus is relabeled to its associated cluster and is used to train the word representation for that cluster. |
Related Work | Two other recent papers (Dhillon et al., 2011; Reddy et al., 2011) present models for constructing word representations that deal with context. |
Experiments | As mentioned in Section 3.1, the knowledge learned from the WRRBM can be investigated incrementally, using word representation , which corresponds to initializing only the projection layer of web-feature module with the projection matrix of the learned WRRBM, or ngram-level representation, which corresponds to initializing both the projection and sigmoid layers of the web-feature module by the learned WRRBM. |
Experiments | “word” and “ngram” denote using word representations and n—gram representations, respectively. |
Experiments | From Figures 2, 3 and 4, we can see that adopting the ngram-level representation consistently achieves better performance compared with using word representations only (“word-fixed” vs “ngram-fixed”, “word-adjusted” vs “ngram-adjusted”). |
Learning from Web Text | We utilize the Word Representation RBM (WRRBM) factorization proposed by Dahl et al. |
Learning from Web Text | The basic idea is to share word representations across different positions in the input n-gram while using position-dependent weights to distinguish between different word orders. |
Learning from Web Text | ,W(n)} can be trained using a Metropolis-Hastings-based CD variant and the learned word representations also capture certain syntactic information; see Dahl et al. |
Neural Network for POS Disambiguation | We can choose to use only the word representations of the learned WRRBM. |
Experiments | (2013) in that it uses additional features (vector space word representations ) and a different classification method (we use random forests while Tsvetkov et al. |
Methodology | We define three main feature categories (1) abstractness and imageability, (2) supersenses, (3) unsupervised vector-space word representations ; each category corresponds to a group of features with a common theme and representation. |
Methodology | 0 Vector space word representations . |
Methodology | Vector space word representations learned using unsupervised algorithms are often effective features in supervised learning methods (Turian et al., 2010). |
Model and Feature Extraction | Vector space word representations . |
Model and Feature Extraction | We employ 64-dimensional vector-space word representations constructed by Faruqui and Dyer (2014).14 Vector construction algorithm is a variation on traditional latent semantic analysis (Deerwester et al., 1990) that uses multilingual information to produce representations in which synonymous words have similar vectors. |
Introduction | Within a monolingual context, the distributional hypothesis (Firth, 1957) forms the basis of most approaches for learning word representations . |
Introduction | Unlike most methods for learning word representations , which are restricted to a single language, our approach learns to represent meaning across languages in a shared multilingual semantic space. |
Related Work | Neural language models are another popular approach for inducing distributed word representations (Bengio et al., 2003). |
Related Work | Unsupervised word representations can easily be plugged into a variety of NLP related tasks. |
Related Work | Hermann and Blunsom (2014) propose a large-margin learner for multilingual word representations , similar to the basic additive model proposed here, which, like the approaches above, relies on a bag-of-words model for sentence representations. |
Abstract | Most existing algorithms for learning continuous word representations typically only model the syntactic context of words but ignore the sentiment of text. |
Introduction | Accordingly, it is a crucial step to learn the word representation (or word embedding), which is a dense, low-dimensional and real-valued vector for a word. |
Related Work | However, the one-hot word representation cannot sufficiently capture the complex linguistic characteristics of words. |
Related Work | The results of bag-of-ngram (unUbUtri-gram) features are not satisfied because the one-hot word representation cannot capture the latent connections between words. |
Related Work | In this paper, we propose learning continuous word representations as features for Twitter sentiment classification under a supervised learning framework. |
Abstract | Given labeled data annotated with frame-semantic parses, we learn a model that projects the set of word representations for the syntactic context around a predicate to a low dimensional representation. |
Introduction | We present a new technique for semantic frame identification that leverages distributed word representations . |
Introduction | h Distributed Word Representations |