Abstract | We demonstrate that distributional representations of word types, trained on unannotated text, can be used to improve performance on rare words. |
Introduction | We investigate the use of distributional representations , which model the probability distribution of a word’s context, as techniques for finding smoothed representations of word sequences. |
Introduction | That is, we use the distributional representations to share information across unannotated examples of the same word type. |
Introduction | We then compute features of the distributional representations , and provide them as input to our supervised sequence labelers. |
Smoothing Natural Language Sequences | Importantly, we seek distributional representations that will provide features that are common in both training and test data, to avoid data sparsity. |
Smoothing Natural Language Sequences | In the next three sections, we develop three techniques for smoothing text using distributional representations . |
Smoothing Natural Language Sequences | This gives greater weight to words with more idiosyncratic distributions and may improve the informativeness of a distributional representation . |
Abstract | We propose a novel approach that integrates the distributional representation of multiple subsets of the MWP’s words. |
Background and Related Work | While previous work focused either on improving the quality of the distributional representations themselves or on their incorporation into more elaborate systems, we focus on the integration of the distributional representation of multiple LCs to improve the identification of inference relations between MWPs. |
Background and Related Work | Much work in recent years has concentrated on the relation between the distributional representations of composite phrases and the representations of their component subparts (Widdows, 2008; Mitchell and Lapata, 2010; Baroni and Zampar—elli, 2010; Coecke et al., 2010). |
Background and Related Work | Despite significant advances, previous work has mostly been concerned with highly compositional cases and does not address the distributional representation of predicates of varying degrees of compositionality. |
Conclusion | We have presented a novel approach to the distributional representation of multi-word predicates. |
Discussion | Much recent work subsumed under the title Compositional Distributional Semantics addressed the distributional representation of multi-word phrases (see Section 2). |
Discussion | A standard approach in CD8 is to compose distributional representations by taking their vector sum 2),; 2 211 + 212... + on and ’UR = 2/1 + + vjn (Mitchell and Lapata, 2010). |
Introduction | This heterogeneity of “take” is likely to have a negative effect on downstream systems that use its distributional representation . |
Introduction | For instance, while “take” and “accept” are often considered lexically similar, the high frequency in which “take” participates in non-compositional MWPs is likely to push the two verbs’ distributional representations apart. |
Introduction | This approach allows the classifier that uses the distributional representations to take into account the most relevant LCs in order to make the prediction. |
Our Proposal: A Latent LC Approach | We propose a method for addressing MWPs of varying degrees of compositionality through the integration of the distributional representation of multiple subsets of the predicate’s words (LCs). |
Experiments | (2012), learning distributed representations on the Europarl corpus and evaluating on documents from the Reuters RCVIRCV2 corpora. |
Experiments | We use the training data of the corpus to learn distributed representations across 12 languages. |
Experiments | In a third evaluation (Table 4), we apply the embeddings learnt with out models to a monolingual classification task, enabling us to compare with prior work on distributed representation learning. |
Introduction | Distributed representations of words provide the basis for many state-of-the-art approaches to various problems in natural language processing today. |
Overview | Distributed representation learning describes the task of learning continuous representations for discrete objects. |
Overview | Such distributed representations allow a model to share meaning between similar words, and have been used to capture semantic, syntactic and morphological content (Collobert and Weston, 2008; Turian et al., 2010, inter alia). |
Overview | Some work has exploited this idea for transferring linguistic knowledge into low-resource languages or to learn distributed representations at the word level (Klementiev et al., 2012; Zou et al., 2013; Lauly et al., 2013, inter alia). |
Related Work | Distributed Representations Distributed representations can be learned through a number of approaches. |
Related Work | Tasks, where the use of distributed representations has resulted in improvements include topic modelling (Blei et al., 2003) or named entity recognition (Turian et al., 2010; Collobert et al., 2011). |
Related Work | Multilingual Representation Learning Most research on distributed representation induction has focused on single languages. |
Distributed representations | Another approach to word representation is to learn a distributed representation . |
Distributed representations | (Not to be confused with distributional representations .) |
Distributed representations | A distributed representation is dense, low-dimensional, and real-valued. |
Distributional representations | LSA (Dumais et al., 1988; Landauer et al., 1998), LSI, and LDA (Blei et al., 2003) induce distributional representations over F in which each column is a document context. |
Distributional representations | However, like all the works cited above, Sahlgren (2006) only uses distributional representation to improve existing systems for one-shot classification tasks, such as IR, WSD, semantic knowledge tests, and text categorization. |
Distributional representations | Previous research has achieved repeated successes on these tasks using clustering representations (Section 3) and distributed representations (Section 4), so we focus on these representations in our work. |
Supervised evaluation tasks | We apply clustering and distributed representations to NER and chunking, which allows us to compare our semi-supervised models to those of Ando and Zhang (2005) and Suzuki and Isozaki (2008). |
Background | Distributional representations encode an expression by its environment, assuming the context-dependent nature of meaning according to which one “shall know a word by the company it keeps” (Firth, 1957). |
Background | Distributional representations are frequently used to encode single words as vectors. |
Background | While it is theoretically possible to apply the same mechanism to larger expressions, sparsity prevents learning meaningful distributional representations for expressions much larger than single words.2 |
Introduction | While distributional semantics is easily applied to single words, sparsity implies that attempts to directly extract distributional representations for larger expressions are doomed to fail. |
Constraints on Inter-Domain Variability | As we discussed in the introduction, our goal is to provide a method for domain adaptation based on semi-supervised learning of models with distributed representations . |
Discussion and Conclusions | In this paper we presented a domain-adaptation method based on semi-supervised learning with distributed representations coupled with constraints favoring domain-independence of modeled phenomena. |
Introduction | Such LVMs can be regarded as composed of two parts: a mapping from initial (normally, word-based) representation to a new shared distributed representation , and also a classifier in this representation. |
Related Work | Semi-supervised leam-ing with distributed representations and its application to domain adaptation has previously been considered in (Huang and Yates, 2009), but no attempt has been made to address problems specific to the domain-adaptation setting. |
The Latent Variable Model | The adaptation method advocated in this paper is applicable to any joint probabilistic model which uses distributed representations , i.e. |
Conventional Neural Network | The idea of distributed representation for symbolic data is one of the most important reasons why the neural network works. |
Experiment | Wang and Manning (2013) conduct an empirical study on the effect of nonlinearity and the results suggest that nonlinear models are highly effective only when distributed representation is used. |
Experiment | To explain why distributed representation captures more information than discrete features, we show in Table 4 the effect of character embeddings which are obtained from the lookup table of MMTNN after training. |
Experiment | Therefore, compared with discrete feature representations, distributed representation can capture the syntactic and semantic similarity between characters. |
Max-Margin Tensor Neural Network | To better model the tag-tag interaction given the context characters, distributed representation for tags instead of traditional discrete symbolic representation is used in our model. |
Introduction | Notable among the most effective distributional representations are the recent deep-learning approaches by Socher et al. |
Introduction | With such a working definition, contiguous motifs are likely to make distributional representations less noisy and also assist in disambiguating context. |
Introduction | Also, the lack of specificity ensures that such motifs are common enough to meaningfully influence distributional representation beyond single tokens. |
Composition methods | (2010) take inspiration from formal semantics to characterize composition in terms of function application, where the distributional representation of one element in a composition (the functor) is not a vector but a function. |
Introduction | It is natural to hypothesize that the same methods can be applied to morphology to derive the meaning of complex words from the meaning of their parts: For example, instead of harvesting a rebuild vector directly from the corpus, the latter could be constructed from the distributional representations of re- and build. |
Introduction | We adapt a number of composition methods from the literature to the morphological setting, and we show that some of these methods can provide better distributional representations of derived forms than either those directly harvested from a large corpus, or those obtained by using the stem as a proxy to derived-form meaning. |
Related work | Our goal is to automatically construct, given distributional representations of stems and affixes, semantic representations for the derived words containing those stems and affixes. |
Introduction | Distributional representations of words have been successfully used in many language processing tasks such as entity set expansion (Pantel et al., 2009), part-of-speech (POS) tagging and chunking (Huang and Yates, 2009), ontology learning (Curran, 2005), computing semantic textual similarity (Besancon et al., 1999), and lexical inference (Kotlerman et al., 2012). |
Introduction | Consequently, the distributional representations of the word lightweight will differ considerably between the two domains. |
Introduction | The SVD smoothing in the first step both reduces the data sparseness in distributional representations of individual words, as well as the dimensionality of the feature space, thereby enabling us to efficiently and accurately learn a prediction model using PLSR in the second step. |
O \ | We first create a distributional representation for a word using the data from a single domain, and then learn a Partial Least Square Regression (PLSR) model to predict the distribution of a word in a target domain given its distribution in a source domain. |
Improving a distributional thesaurus | The principles presented in the previous section face one major problem compared to the “classical” distributional approach : the semantic similarity of two words can be evaluated directly by computing the similarity of their distributional representations . |
Principles | As a result, the distributional representation of a word takes the unstructured form of a bag of words or the more structured form of a set of pairs {syntactic relation, word}. |
Principles | A variant of this approach was proposed in (Kazama et al., 2010) where the distributional representation of a word is modeled as a multinomial distribution with Dirichlet as prior. |
Abstract | We present a novel technique for semantic frame identification using distributed representations of predicates and their syntactic context; this technique leverages automatic syntactic parses and a generic set of word embeddings. |
Introduction | Distributed representations of words have proved useful for a number of tasks. |
Overview | A word embedding is a distributed representation of meaning where each word is represented as a vector in R”. |