Abstract | We propose Bilingually-constrained Recursive Auto-encoders (BRAE) to learn semantic phrase embeddings (compact vector representations for phrases), which can distinguish the phrases with different semantic meanings. |
Bilingually-constrained Recursive Auto-encoders | 3.1.1 Word Vector Representations |
Bilingually-constrained Recursive Auto-encoders | In phrase embedding using composition, the word vector representation is the basis and serves as the input to the neural network. |
Bilingually-constrained Recursive Auto-encoders | Given a phrase which is an ordered list of m words, each word has an index i into the columns of the embedding matrix L. The index i is used to retrieve the word’s vector representation using a simple multiplication with a binary vector 6 which is zero in all positions except for the ith index: |
Introduction | embedding, which converts a word into a dense, low dimensional, real-valued vector representation (Bengio et al., 2003; Bengio et al., 2006; Collobert and Weston, 2008; Mikolov et al., 2013). |
Introduction | Therefore, in order to successfully apply DNN to model the whole translation process, such as modelling the decoding process, learning compact vector representations for the basic phrasal translation units is the essential and fundamental work. |
Related Work | In contrast, our method attempts to learn the semantic vector representation for any phrase. |
Conclusion | We have presented a novel method for adapting the vector representations of words according to their context. |
Introduction | Second, the vectors of two syntactically related words, e. g., a target verb acquire and its direct object knowledge, typically have different syntactic environments, which implies that their vector representations encode complementary information and there is no direct way of combining the information encoded in the respective vectors. |
Introduction | To solve these problems, we build upon previous work (Thater et al., 2009) and propose to use syntactic second-order vector representations . |
Introduction | Second-order vector representations in a bag-of-words setting were first used by Schutze (1998); in a syntactic setting, they also feature in Dligach and Palmer (2008). |
Related Work | Several approaches to contextualize vector representations of word meaning have been proposed. |
Related Work | By using vector representations of a predicate p and an argument a, Kintsch identifies words |
Related Work | Mitchell and Lapata (2008), henceforth M&L, propose a general framework in which meaning representations for complex expressions are computed compositionally by combining the vector representations of the individual words of the complex expression. |
The model | In this section, we present our method of contextualizing semantic vector representations . |
The model | Our model employs vector representations for words and expressions containing syntax-specific first and second order co-occurrences information. |
The model | The basis for the construction of both kinds of vector representations are co-occurrence graphs. |
Abstract | Instead, we introduce a Compositional Vector Grammar (CVG), which combines PCFGs with a syntactically untied recursive neural network that learns syntactico-semantic, compositional vector representations . |
Introduction | Previous RNN-based parsers used the same (tied) weights at all nodes to compute the vector representing a constituent (Socher et al., 2011b). |
Introduction | Therefore we combine syntactic and semantic information by giving the parser access to rich syntactico-semantic information in the form of distributional word vectors and compute compositional semantic vector representations for longer phrases (Costa et al., 2003; Menchetti et al., 2005; Socher et al., 2011b). |
Introduction | We will first briefly introduce single word vector representations and then describe the CVG objective function, tree scoring and inference. |
Abstract | We introduce the problem of generation in distributional semantics: Given a distributional vector representing some meaning, how can we generate the phrase that best expresses that meaning? |
Evaluation setting | Construction of vector spaces We test two types of vector representations . |
Evaluation setting | (2013a) learns vector representations using a neural network architecture by trying to predict a target word given the words surrounding it. |
Evaluation setting | (2014) for an extensive comparison of the two types of vector representations . |
General framework | To construct the vector representing a two-word phrase, we must compose the vectors associated to the input words. |
General framework | where {i and 27 are the vector representations associated to words u and v. fcompR : Rd >< Rd —> Rd (for d the dimensionality of vectors) is a composition function specific to the syntactic relation R holding between the two words.1 |
Introduction | For example, given the vectors representing red and car, composition derives a vector that approximates the meaning of red car. |
Introduction | We can, for example, synthesize the vector representing the meaning of a phrase or sentence, and then generate alternative phrases or sentences from this vector to accomplish true paraphrase generation (as opposed to paraphrase detection or ranking of candidate paraphrases). |
Introduction | Given a vector representing an image, generation can be used to productively construct phrases or sentences that describe the image (as opposed to simply retrieving an existing description from a set of candidates). |
Experimental Setup | To add auxiliary word vector representations , we use the publicly available word vectors (Cirik |
Introduction | Even in the case of first-order parsers, this results in a high-dimensional vector representation of each arc. |
Introduction | participating in an arc, such as continuous vector representations of words. |
Introduction | Finally, we demonstrate that the model can successfully leverage word vector representations , in contrast to the baselines. |
Problem Formulation | Specifically, U gbh (for a given sentence, suppressed) is an 7“ dimensional vector representation of the word corresponding to h as a head word. |
Related Work | Traditionally, these vector representations have been derived primarily from co-occurrences of words within sentences, ignoring syntactic roles of the co-occurring words. |
Related Work | While this method learns to map word combinations into vectors, it builds on existing word-level vector representations . |
Experiments | Given a query word 21) and another word 21/ we obtain their vector representations gbw and gwa, and evaluate their cosine similarity as 8(gbw, gwa) = By assessing the similarity of 212 with all other words 212’, we can find the words deemed most similar by the model. |
Introduction | This component of the model uses the vector representation of words to predict the sentiment annotations on contexts in which the words appear. |
Introduction | This causes words expressing similar sentiment to have similar vector representations . |
Our Model | The energy function uses a word representation matrix R E R“ X M) where each word 21) (represented as a one-on vector) in the vocabulary V has a 6-dimensional vector representation gbw = Rw corresponding to that word’s column in R. The random variable 6 is also a B-dimensional vector, 6 E R5 which weights each of the 6 dimensions of words’ representation vectors. |
Related work | For each latent topic T, the model learns a conditional distribution p(w|T) for the probability that word 21) occurs in T. One can obtain a k:-dimensional vector representation of words by first training a k-topic model and then filling the matrix with the p(w|T) values (normalized to unit length). |
Experimental setup | Annotation of quality of test vectors The quality of the corpus-based vectors representing derived test items was determined by collecting human semantic similarity judgments in a crowdsourcing survey. |
Experimental setup | The first experiment investigates to what extent composition models can approximate high-quality (HQ) corpus-extracted vectors representing derived forms. |
Experimental setup | Lexfunc provides a flexible way to account for affixation, since it models it directly as a function mapping from and onto word vectors, without requiring a vector representation of bound affixes. |
Related work | Although these works exploit vectors representing complex forms, they do not attempt to generate them compositionally. |
Our Approach | The computation process is conducted in a bottom-up manner, and the vector representations are computed recursively. |
RNN: Recursive Neural Network | It performs compositions based on the binary trees, and obtain the vector representations in a bottom-up way. |
RNN: Recursive Neural Network | The vector representation v is obtained via: |
RNN: Recursive Neural Network | The vector representation of root node is then fed into a softmax classifier to predict the label. |
Model overview | Our framework accommodates any paraphrasing method, and in this paper we propose an association model that learns to associate natural language phrases that co-occur frequently in a monolingual parallel corpus, combined with a vector space model, which learns to score the similarity between vector representations of natural language utterances (Section 5). |
Paraphrasing | We now introduce a vector space (VS) model, which assigns a vector representation for each utterance, and learns a scoring function that ranks paraphrase candidates. |
Paraphrasing | We start by constructing vector representations of words. |
Paraphrasing | We can now estimate a paraphrase score for two utterances cc and 0 via a weighted combination of the components of the vector representations: |
Autoencoders for Grounded Semantics | The target vector is the sum of X0) and the centroid X0) of the remaining attribute vectors representing object 0. |
Experimental Setup | As shown in Figure 1, our model takes as input two (real-valued) vectors representing the visual and textual modalities. |
Experimental Setup | respond to words and edges to cosine similarity scores between vectors representing their meaning. |
Results | Table 6 shows examples of clusters produced by Chinese Whispers when using vector representations provided by the SAE model. |
Compositional distributional semantics | If distributional vectors encode certain aspects of word meaning, it is natural to expect that similar aspects of sentence meaning can also receive vector representations , obtained compositionally from word vectors. |
Compositional distributional semantics | Deverbal nouns like demolition, often used without mention of who demolished what, would have to get vector representations while the corresponding verbs (demolish) would become tensors, which makes immediately related verbs and nouns incomparable. |
The practical lexical function model | The matrices formalize argument slot saturation, operating on an argument vector representation through matrix by vector multiplication, as described in the next section. |
The practical lexical function model | This flexibility makes our model suitable to compute vector representations of sentences without stumbling at unseen syntactic usages of words. |
Introduction | In this paper, we rely on the same image analysis techniques but instead focus on the reference problem: We do not aim at enriching word representations with visual information, although this might be a side effect of our approach, but we address the issue of automatically mapping objects, as depicted in images, to the context vectors representing the corresponding words. |
Introduction | We show that the induced cross-modal semantic space is powerful enough that sensible guesses about the correct word denoting an object can be made, even when the linguistic context vector representing the word has been created from as little as 1 sentence containing it. |
Introduction | First, we conduct experiments with simple image-and text-based vector representations and compare alternative methods to perform cross-modal mapping. |
Related Work | (2013) use linear regression to transform vector-based image representations onto vectors representing the same concepts in linguistic semantic space. |
Extraction from Documents and Queries | 2) construction of internal search-signature vector representations for each candidate attribute, based |
Extraction from Documents and Queries | 3) construction of a reference internal search-signature vector representation for a small set of seed attributes provided as input. |
Extraction from Documents and Queries | 4) ranking of candidate attributes with respect to each class (e.g., movies), by computing similarity scores between their individual vector representations and the reference vector of the seed attributes. |
Distributional Semantic Hidden Markov Models | Unlike in most applications of HMMs in text processing, in which the representation of a token is simply its word or lemma identity, tokens in DSHMM are also associated with a vector representation of their meaning in context according to a distributional semantic model (Section 3.1). |
Distributional Semantic Hidden Markov Models | All the methods below start from this basic vector representation . |
Distributional Semantic Hidden Markov Models | Let event head h be the syntactic head of a number of arguments a1,a2, ...am, and 27h,27a1,27a2, ...27am be their respective vector representations according to the SIMPLE method. |
Integrating Semantic Constraint into Surprisal | The factor A(wn, h) is essentially based on a comparison between the vector representing the current word wn and the vector representing the prior history h. Varying the method for constructing word vectors (e. g., using LDA or a simpler semantic space model) and for combining them into a representation of the prior context h (e.g., using additive or multiplicative functions) produces distinct models of semantic composition. |
Integrating Semantic Constraint into Surprisal | The calculation of A is then based on a weighted dot product of the vector representing the upcoming word w, with the vector representing the prior context h: |
Models of Processing Difficulty | In this framework, the similarity between two words can be easily quantified, e.g., by measuring the cosine of the angle of the vectors representing them. |
Models of Processing Difficulty | Specifically, the simpler space is based on word co-occurrence counts; it constructs the vector representing a given target word, t, by identifying all the tokens oft in a corpus and recording the counts of context words, 6, (within a specific window). |
Experimental Approach | Generating Visual Representations Visual vector representations for each image were obtained using the well-known bag of visual words (BoVW) approach (Sivic and Zisserman, 2003). |
Experimental Approach | BoVW obtains a vector representation for an |
Experimental Approach | Generating Linguistic Representations We extract continuous vector representations (also of 50 dimensions) for concepts using the continuous log-linear skipgram model of Mikolov et al. |
Introduction | In order to isolate the contribution from word embeddings, it is useful to demonstrate improvement over a parser that already achieves state-of-the-art performance without vector representations . |
Parser extensions | gb(w) is the vector representation of the word 212, am, are per-basis weights, and 6 is an inverse radius parameter which determines the strength of the smoothing. |
Parser extensions | vector representation . |
Background | The recursive application of autoencoders was first introduced in Pollack (1990), whose recursive auto-associative memories learn vector representations over pre-specified recursive data structures. |
Learning | The unsupervised method described so far learns a vector representation for each sentence. |
Model | Their purpose is to learn semantically meaningful vector representations for sentences and phrases of variable size, while the purpose of this paper is to investigate the use of syntax and linguistic formalisms in such vector-based compositional models. |
Experiments | LinLearn denotes model combination by overloading the vector representation of queries q and documents (1 in the VW linear learner by incorporating arbitrary ranking models as dense features. |
Model Combination | This means that the vector representation of queries q and documents (1 in the VW linear learner is overloaded once more: In addition to dense domain-knowledge features, we incorporate arbitrary ranking models as dense features whose value is the score of the ranking model. |
Translation and Ranking for CLIR | Optimization for these additional models including domain knowledge features was done by overloading the vector representation of queries q and documents (1 in the VW linear learner: Instead of sparse word-based features, q and d are represented by real-valued vectors of dense domain-knowledge features. |
Introduction | Finally, we perform SVD on the motif similarity matrix (with size of the order of the total vocabulary in the corpus), and retain the first k principal eigenvectors to obtain low-dimensional vector representations that are more convenient to work with. |
Introduction | For composing the motifs representations to get judgments on semantic similarity of sentences, we use our recent Vector Tree Kernel approach The VTK approach defines a convo-lutional kernel over graphs defined by the dependency parses of sentences, using a vector representation at each graph node that representing a single lexical token. |
Introduction | For this task, we again use the VTK formalism for combining vector representations of the individual motifs. |
Introduction | We tabulate the transitions of entities between different syntactic positions (or their nonoccurrence) in sentences, and convert the frequencies of transitions into a feature vector representation of transition probabilities in the document. |
Introduction | We solve this problem in a supervised machine learning setting, where the input is the feature vector representations of the two versions of the document, and the output is a binary value indicating the document with the original sentence ordering. |
Introduction | Transition length — the maximum length of the transitions used in the feature vector representation of a document. |