Abstract | Specifically, we propose a new topic model called Probabilistic Cross-Lingual Latent Semantic Analysis (PCLSA) which extends the Probabilistic Latent Semantic Analysis (PLSA) model by regularizing its likelihood function with soft constraints defined based on a bilingual dictionary. |
Conclusion | the Probabilistic Cross-Lingual Latent Semantic Analysis (PCLSA) model) that can incorporate translation knowledge in bilingual dictionaries as a regularizer to constrain the parameter estimation so that the learned topic models would be synchronized in multiple languages. |
Introduction | As a robust unsupervised way to perform shallow latent semantic analysis of topics in text, probabilistic topic models (Hofmann, 1999a; Blei et al., 2003b) have recently attracted much attention. |
Introduction | In this paper, we propose a novel topic model, called Probabilistic Cross-Lingual Latent Semantic Analysis (PCLSA) model, which can be used to mine shared latent topics from unaligned text data in different languages. |
Introduction | PCLSA extends the Probabilistic Latent Semantic Analysis (PLSA) model by regularizing its likelihood function with soft constraints defined based on a bilingual dictionary. |
Probabilistic Cross-Lingual Latent Semantic Analysis | In this section, we present our probabilistic cross-lingual latent semantic analysis (PCLSA) model and discuss how it can be used to extract cross-lingual topics from multilingual text data. |
Related Work | Many topic models have been proposed, and the two basic models are the Probabilistic Latent Semantic Analysis (PLSA) model (Hofmann, 1999a) and the Latent Dirichlet Allocation (LDA) model (Blei et al., 2003b). |
A Model of Semantics | Figure 3: The semantics-text correspondence model with K documents sharing the same latent semantic state. |
Abstract | A simple and efficient inference method recursively induces joint semantic representations for each group and discovers correspondence between lexical entries and latent semantic concepts. |
Introduction | We assume that each text in a group is independently generated from a full latent semantic state corresponding to the group. |
Introduction | Unsupervised learning with shared latent semantic representations presents its own challenges, as exact inference requires marginalization over possible assignments of the latent semantic state, consequently, introducing nonlocal statistical dependencies between the decisions about the semantic structure of each text. |
Summary and Future Work | However, exact inference for groups of documents with overlapping semantic representation is generally prohibitively expensive, as the shared latent semantics introduces nonlocal dependences between semantic representations of individual documents. |
A Distributional Model for Argument Classification | Latent Semantic Analysis (LSA) (Landauer and Dumais, 1997), is then applied to M to acquire meaningful representations LSA exploits the linear transformation called Singular Value Decomposition (SVD) and produces an approximation of the original matrix M, capturing (semantic) dependencies between context vectors. |
Empirical Analysis | Clustering, as discussed in Section 3.1, allows to generalize lexical information: similar heads within the latent semantic space are built from the annotated examples and they allow to predict the behavior of new unseen words as found in the test sentences. |
Introduction | Moreover, it generalizes lexical information about the annotated examples by applying a geometrical model, in a Latent Semantic Analysis style, inspired by a distributional paradigm (Pado |