Modeling Sentences in the Latent Space
Guo, Weiwei and Diab, Mona

Article Structure

Abstract

Sentence Similarity is the process of computing a similarity score between two sentences.

Introduction

Identifying the degree of semantic similarity [SS] between two sentences is at the core of many NLP applications that focus on sentence level semantics such as Machine Translation (Kauchak and Barzilay, 2006), Summarization (Zhou et al., 2006), Text Coherence Detection (Lapata and Barzilay, 2005), etc.To date, almost all Sentence Similarity [SS] approaches work in the high-dimensional word space and rely mainly on word similarity.

Limitations of Topic Models and LSA for Modeling Sentences

Usually latent variable models aim to find a latent semantic profile for a sentence that is most relevant to the observed words.

The Proposed Approach

3.1 Weighted Matrix Factorization

Evaluation for SS

We need to show the impact of our proposed model WTMF on the SS task.

Experiments and Results

We evaluate WTMF on three data sets: 1.

Related Work

Almost all current SS methods work in the high-dimensional word space, and rely heavily on word/ sense similarity measures, which is knowledge based (Li et a1., 2006; Feng et a1., 2008; Ho et a1., 2010; Tsatsaronis et a1., 2010), corpus-based (Islam

Conclusions

We explicitly model missing words to alleviate the sparsity problem in modeling short texts.

Topics

LDA

Appears in 26 sentences as: LDA (29)
In Modeling Sentences in the Latent Space
  1. Latent variable models, such as Latent Semantic Analysis [LSA] (Landauer et al., 1998), Probabilistic Latent Semantic Analysis [PLSA] (Hofmann, 1999), Latent Dirichlet Allocation [ LDA ] (Blei et al., 2003) can solve the two issues naturally by modeling the semantics of words and sentences simultaneously in the low-dimensional latent space.
    Page 1, “Introduction”
  2. Therefore, PLSA finds a topic distribution for each concept definition that maximizes the log likelihood of the corpus X ( LDA has a similar form):
    Page 2, “Limitations of Topic Models and LSA for Modeling Sentences”
  3. The performance of WTMF on CDR is compared with (a) an Information Retrieval model (IR) that is based on surface word matching, (b) an n-gram model (N-gram) that captures phrase overlaps by returning the number of overlapping ngrams as the similarity score of two sentences, (c) LSA that uses svds() function in Matlab, and (d) LDA that uses Gibbs Sampling for inference (Griffiths and Steyvers, 2004).
    Page 5, “Experiments and Results”
  4. WTMF is also compared with all existing reported SS results on L106 and MSR04 data sets, as well as LDA that is trained on the same data as WTMF.
    Page 5, “Experiments and Results”
  5. To eliminate randomness in statistical models (WTMF and LDA ), all the reported results are averaged over 10 runs.
    Page 5, “Experiments and Results”
  6. And we run 5000 iterations for LDA; each LDA model is averaged over the last 10 Gibbs Sampling iterations to get more robust predictions.
    Page 5, “Experiments and Results”
  7. The latent vector of a sentence is computed by: (1) using equation 4 in WTMF, or (2) summing up the latent vectors of all the constituent words weighted by Xij in LSA and LDA , similar to the work reported in (Mihalcea et al., 2006).
    Page 5, “Experiments and Results”
  8. For LDA the latent vector of a word is computed by P(z|w).
    Page 5, “Experiments and Results”
  9. All the latent variable models (LSA, LDA , WTMF) are built on the same set of corpus: WN+Wik+Brown (393, 666 sentences and 4, 262, 026 words).
    Page 6, “Experiments and Results”
  10. We mainly compare the performance of IR, N-gram, LSA, LDA , and WTMF models.
    Page 6, “Experiments and Results”
  11. In LDA , we choose an optimal combination of a and 6 from {0.01, 0.05, 0.1, 0.5}.In WTMF, we choose the best parameters of weight wm for missing words and A for regularization.
    Page 6, “Experiments and Results”

See all papers in Proc. ACL 2012 that mention LDA.

See all papers in Proc. ACL that mention LDA.

Back to top.

latent semantics

Appears in 12 sentences as: Latent Semantic (2) latent semantic (1) latent semantics (12)
In Modeling Sentences in the Latent Space
  1. Previous sentence similarity work finds that latent semantics approaches to the problem do not perform well due to insufficient information in single sentences.
    Page 1, “Abstract”
  2. Latent variable models, such as Latent Semantic Analysis [LSA] (Landauer et al., 1998), Probabilistic Latent Semantic Analysis [PLSA] (Hofmann, 1999), Latent Dirichlet Allocation [LDA] (Blei et al., 2003) can solve the two issues naturally by modeling the semantics of words and sentences simultaneously in the low-dimensional latent space.
    Page 1, “Introduction”
  3. We believe that the latent semantics approaches applied to date to the SS problem have not yielded positive results due to the deficient modeling of the meymdwwmmMCwmeSSqwmanaww limited contextual setting where the sentences are typically very short to derive robust latent semantics .
    Page 1, “Introduction”
  4. Apart from the SS setting, robust modeling of the latent semantics of short sentences/texts is becoming a pressing need due to the pervasive presence of more bursty data sets such as Twitter feeds and SMS where short contexts are an inherent characteristic of the data.
    Page 1, “Introduction”
  5. Usually latent variable models aim to find a latent semantic profile for a sentence that is most relevant to the observed words.
    Page 2, “Limitations of Topic Models and LSA for Modeling Sentences”
  6. By explicitly modeling missing words, we set another criterion to the latent semantics profile: it should not be related to the missing words in the sentence.
    Page 2, “Limitations of Topic Models and LSA for Modeling Sentences”
  7. It would be desirable if topic models can exploit missing words (a lot more data than observed words) to render more nuanced latent semantics , so that pairs of sentences in the same domain can be differentiable.
    Page 2, “Limitations of Topic Models and LSA for Modeling Sentences”
  8. The three latent semantics profiles in table 1 illustrate our analysis for topic models and LSA.
    Page 3, “Limitations of Topic Models and LSA for Modeling Sentences”
  9. Accordingly, Pfl; is a K -dimension latent semantics vector profile for word 10,; similarly, Q.,j is the K—dimension vector profile that represents the sentence 83-.
    Page 3, “The Proposed Approach”
  10. This solution is quite elegant: 1. it explicitly tells the model that in general all missing words should not be related to the sentence; 2. meanwhile latent semantics are mainly generalized based on observed words, and the model is not penalized too much (wm is very small) when it is very confident that the sentence is highly related to a small subset of missing words based on their latent semantics profiles (bank:#n#1 definition sentence is related to its missing words check loan).
    Page 4, “The Proposed Approach”
  11. This is because LDA only uses 10 observed words to infer a 100 dimension vector for a sentence, while WTMF takes advantage of much more missing words to learn more robust latent semantics vectors.
    Page 6, “Experiments and Results”

See all papers in Proc. ACL 2012 that mention latent semantics.

See all papers in Proc. ACL that mention latent semantics.

Back to top.

latent variable

Appears in 7 sentences as: Latent variable (1) latent variable (6)
In Modeling Sentences in the Latent Space
  1. In this paper, we show that by carefully handling words that are not in the sentences (missing words), we can train a reliable latent variable model on sentences.
    Page 1, “Abstract”
  2. Experiments on the new task and previous data sets show significant improvement of our model over baselines and other traditional latent variable models.
    Page 1, “Abstract”
  3. Latent variable models, such as Latent Semantic Analysis [LSA] (Landauer et al., 1998), Probabilistic Latent Semantic Analysis [PLSA] (Hofmann, 1999), Latent Dirichlet Allocation [LDA] (Blei et al., 2003) can solve the two issues naturally by modeling the semantics of words and sentences simultaneously in the low-dimensional latent space.
    Page 1, “Introduction”
  4. After analyzing the way traditional latent variable models (LSA, PLSNLDA) handle missing words, we decide to model sentences using a weighted matrix factorization approach (Srebro and J aakkola, 2003), which allows us to treat observed words and missing words differently.
    Page 2, “Introduction”
  5. Usually latent variable models aim to find a latent semantic profile for a sentence that is most relevant to the observed words.
    Page 2, “Limitations of Topic Models and LSA for Modeling Sentences”
  6. All the latent variable models (LSA, LDA, WTMF) are built on the same set of corpus: WN+Wik+Brown (393, 666 sentences and 4, 262, 026 words).
    Page 6, “Experiments and Results”
  7. In these latent variable models, there are several essential parameters: weight of missing words wm, and dimension K. Figure 2 and 3 analyze the impact of these parameters on ATOPteSt.
    Page 7, “Experiments and Results”

See all papers in Proc. ACL 2012 that mention latent variable.

See all papers in Proc. ACL that mention latent variable.

Back to top.

topic models

Appears in 6 sentences as: Topic models (1) topic models (5)
In Modeling Sentences in the Latent Space
  1. Topic models (PLSNLDA) do not explicitly model missing words.
    Page 2, “Limitations of Topic Models and LSA for Modeling Sentences”
  2. However, empirical results show that given a small number of observed words, usually topic models can only find one topic (most evident topic) for a sentence, e. g., the concept definitions of banh#n#1 and stoch#n#1 are assigned the financial topic only without any further discernabil-ity.
    Page 2, “Limitations of Topic Models and LSA for Modeling Sentences”
  3. The reason is topic models try to learn a 100-dimension latent vector (assume dimension K = 100) from very few data points (10 observed words on average).
    Page 2, “Limitations of Topic Models and LSA for Modeling Sentences”
  4. It would be desirable if topic models can exploit missing words (a lot more data than observed words) to render more nuanced latent semantics, so that pairs of sentences in the same domain can be differentiable.
    Page 2, “Limitations of Topic Models and LSA for Modeling Sentences”
  5. The three latent semantics profiles in table 1 illustrate our analysis for topic models and LSA.
    Page 3, “Limitations of Topic Models and LSA for Modeling Sentences”
  6. The first latent vector (generated by topic models ) is chosen by maximizing Robs = 600.
    Page 3, “Limitations of Topic Models and LSA for Modeling Sentences”

See all papers in Proc. ACL 2012 that mention topic models.

See all papers in Proc. ACL that mention topic models.

Back to top.

co-occurrence

Appears in 5 sentences as: co-occurrence (5)
In Modeling Sentences in the Latent Space
  1. 2. word co-occurrence information is not sufficiently exploited.
    Page 1, “Introduction”
  2. LSA and PLSNLDA work on a word-sentence co-occurrence matrix.
    Page 2, “Limitations of Topic Models and LSA for Modeling Sentences”
  3. The yielded M X N co-occurrence matrix X comprises the TF-IDF values in each X ij cell, namely that TF-IDF value of word w, in sentence 83-.
    Page 2, “Limitations of Topic Models and LSA for Modeling Sentences”
  4. After the SS model learns the co-occurrence of words from WN definitions, in the testing phase, given an ON definition d, the SS algorithm needs to identify the equivalent WN definitions by computing the similarity values between all WN definitions and the ON definition d, then sorting the values in decreasing order.
    Page 5, “Evaluation for SS”
  5. For the Brown corpus, each sentence is treated as a document in order to create more coherent co-occurrence values.
    Page 5, “Experiments and Results”

See all papers in Proc. ACL 2012 that mention co-occurrence.

See all papers in Proc. ACL that mention co-occurrence.

Back to top.

sentence pairs

Appears in 5 sentences as: sentence pair (2) sentence pairs (3)
In Modeling Sentences in the Latent Space
  1. The two data sets we know of for SS are: l. human-rated sentence pair similarity data set (Li et al., 2006) [L106]; 2. the Microsoft Research Paraphrase Corpus (Dolan et al., 2004) [MSR04].
    Page 4, “Evaluation for SS”
  2. On the other hand, the MSR04 data set comprises a much larger set of sentence pairs : 4,076 training and 1,725 test pairs.
    Page 4, “Evaluation for SS”
  3. This is not a problem per se, however the issue is that it is very strict in its assignment of a positive label, for example the following sentence pair as cited in (Islam and Inkpen, 2008) is rated not semantically similar: Ballmer has been vocal in the past warning that Linux is a threat to Microsoft.
    Page 4, “Evaluation for SS”
  4. Note that 7“ and p are much lower for 35 pairs set, since most of the sentence pairs have a very low similarity (the average similarity value is 0.065 in 35 pairs set and 0.367 in 30 pairs set) and SS models need to identify the tiny difference among them, thereby rendering this set much harder to predict.
    Page 7, “Experiments and Results”
  5. We use the same parameter setting used for the L106 evaluation setting since both sets are human-rated sentence pairs (A = 20,10m = 0.01,K = 100).
    Page 8, “Experiments and Results”

See all papers in Proc. ACL 2012 that mention sentence pairs.

See all papers in Proc. ACL that mention sentence pairs.

Back to top.

similarity score

Appears in 5 sentences as: similarity score (3) similarity scores (2)
In Modeling Sentences in the Latent Space
  1. Sentence Similarity is the process of computing a similarity score between two sentences.
    Page 1, “Abstract”
  2. A subset of 30 pairs is further selected by L106 to render the similarity scores evenly distributed.
    Page 4, “Evaluation for SS”
  3. The performance of WTMF on CDR is compared with (a) an Information Retrieval model (IR) that is based on surface word matching, (b) an n-gram model (N-gram) that captures phrase overlaps by returning the number of overlapping ngrams as the similarity score of two sentences, (c) LSA that uses svds() function in Matlab, and (d) LDA that uses Gibbs Sampling for inference (Griffiths and Steyvers, 2004).
    Page 5, “Experiments and Results”
  4. Using a smaller wm means the similarity score is computed mainly from semantics of the observed words.
    Page 7, “Experiments and Results”
  5. This benefits CDR, since it gives more accurate similarity scores for those similar pairs, but not so accurate for dissimilar pairs.
    Page 7, “Experiments and Results”

See all papers in Proc. ACL 2012 that mention similarity score.

See all papers in Proc. ACL that mention similarity score.

Back to top.

N-gram

Appears in 4 sentences as: N-gram (5) n-gram (1)
In Modeling Sentences in the Latent Space
  1. The performance of WTMF on CDR is compared with (a) an Information Retrieval model (IR) that is based on surface word matching, (b) an n-gram model ( N-gram ) that captures phrase overlaps by returning the number of overlapping ngrams as the similarity score of two sentences, (c) LSA that uses svds() function in Matlab, and (d) LDA that uses Gibbs Sampling for inference (Griffiths and Steyvers, 2004).
    Page 5, “Experiments and Results”
  2. The similarity of two sentences is computed by cosine similarity (except N-gram ).
    Page 5, “Experiments and Results”
  3. We mainly compare the performance of IR, N-gram , LSA, LDA, and WTMF models.
    Page 6, “Experiments and Results”
  4. The IR model that works in word space achieves better ATOP scores than N-gram, although the idea of N-gram is commonly used in detecting paraphrases as well as machine translation.
    Page 6, “Experiments and Results”

See all papers in Proc. ACL 2012 that mention N-gram.

See all papers in Proc. ACL that mention N-gram.

Back to top.

objective function

Appears in 3 sentences as: objective function (2) objective function: (1)
In Modeling Sentences in the Latent Space
  1. In effect, LSA allows missing and observed words to equally impact the objective function .
    Page 3, “Limitations of Topic Models and LSA for Modeling Sentences”
  2. Moreover, the true semantics of the concept definitions is actually related to some missing words, but such true semantics will not be favored by the objective function , since equation 2 allows for too strong an impact by Xij = 0 for any missing word.
    Page 3, “Limitations of Topic Models and LSA for Modeling Sentences”
  3. The model parameters (vectors in P and Q) are optimized by minimizing the objective function:
    Page 3, “The Proposed Approach”

See all papers in Proc. ACL 2012 that mention objective function.

See all papers in Proc. ACL that mention objective function.

Back to top.

WordNet

Appears in 3 sentences as: WordNet (3)
In Modeling Sentences in the Latent Space
  1. as WordNet (Fellbaum, 1998) (WN).
    Page 2, “Limitations of Topic Models and LSA for Modeling Sentences”
  2. The SS algorithm has access to all the definitions in WordNet (WN).
    Page 5, “Evaluation for SS”
  3. sourceforgenet, WordNet : :QueryData
    Page 6, “Experiments and Results”

See all papers in Proc. ACL 2012 that mention WordNet.

See all papers in Proc. ACL that mention WordNet.

Back to top.