Cross-Lingual Latent Topic Extraction
Zhang, Duo and Mei, Qiaozhu and Zhai, ChengXiang

Article Structure

Abstract

Probabilistic latent topic models have recently enjoyed much success in extracting and analyzing latent topics in text in an unsupervised way.

Introduction

As a robust unsupervised way to perform shallow latent semantic analysis of topics in text, probabilistic topic models (Hofmann, 1999a; Blei et al., 2003b) have recently attracted much attention.

Related Work

Many topic models have been proposed, and the two basic models are the Probabilistic Latent Semantic Analysis (PLSA) model (Hofmann, 1999a) and the Latent Dirichlet Allocation (LDA) model (Blei et al., 2003b).

Problem Formulation

In general, the problem of cross-lingual topic extraction can be defined as to extract a set of common cross-lingual latent topics covered in text collections in different natural languages.

Probabilistic Cross-Lingual Latent Semantic Analysis

In this section, we present our probabilistic cross-lingual latent semantic analysis (PCLSA) model and discuss how it can be used to extract cross-lingual topics from multilingual text data.

Experiment Design

5.1 Data Set

Experimental Results

6.1 Qualitative Comparison

Conclusion

In this paper, we study the problem of cross-lingual latent topic extraction where the task is to extract a set of common latent topics from multilingual text data.

Acknowledgments

We sincerely thank the anonymous reviewers for their comprehensive and constructive comments.

Topics

cross-lingual

Appears in 37 sentences as: Cross-Lingual (5) cross-lingual (39)
In Cross-Lingual Latent Topic Extraction
  1. One common deficiency of existing topic models, though, is that they would not work well for extracting cross-lingual latent topics simply because words in different languages generally do not co-occur with each other.
    Page 1, “Abstract”
  2. Specifically, we propose a new topic model called Probabilistic Cross-Lingual Latent Semantic Analysis (PCLSA) which extends the Probabilistic Latent Semantic Analysis (PLSA) model by regularizing its likelihood function with soft constraints defined based on a bilingual dictionary.
    Page 1, “Abstract”
  3. Both qualitative and quantitative experimental results show that the PCLSA model can effectively extract cross-lingual latent topics from multilingual text data.
    Page 1, “Abstract”
  4. Although many topic models have been proposed and shown to be useful (see Section 2 for more detailed discussion of related work), most of them share a common deficiency: they are designed to work only for monolingual text data and would not work well for extracting cross-lingual latent topics, i.e.
    Page 1, “Introduction”
  5. In this paper, we propose a novel topic model, called Probabilistic Cross-Lingual Latent Semantic Analysis (PCLSA) model, which can be used to mine shared latent topics from unaligned text data in different languages.
    Page 1, “Introduction”
  6. However, the goals of their work are different from ours in that their models mainly focus on mining cross-lingual topics of matching word pairs and discovering the correspondence at the vocabulary level.
    Page 2, “Introduction”
  7. We use a cross-lingual news data set and a review data set to evaluate PCLSA.
    Page 2, “Introduction”
  8. Experimental results show that the PCLSA model can effectively extract cross-lingual latent topics from multilingual text data, and it outperforms a baseline approach using the standard PLSA on text data in each language.
    Page 2, “Introduction”
  9. (Fung, 1995; Franz et al., 1998; Masuichi et al., 2000; Sadat et al., 2003; Gliozzo and Strapparava, 2006)), but most previous work aims at acquiring word translation knowledge or cross-lingual text categorization from comparable corpora.
    Page 2, “Related Work”
  10. In general, the problem of cross-lingual topic extraction can be defined as to extract a set of common cross-lingual latent topics covered in text collections in different natural languages.
    Page 2, “Problem Formulation”
  11. A cross-lingual latent topic will be represented as a multinomial word distribution over the words in all the languages, i.e.
    Page 2, “Problem Formulation”

See all papers in Proc. ACL 2010 that mention cross-lingual.

See all papers in Proc. ACL that mention cross-lingual.

Back to top.

topic models

Appears in 17 sentences as: topic model (7) topic modeling (1) topic models (11)
In Cross-Lingual Latent Topic Extraction
  1. Probabilistic latent topic models have recently enjoyed much success in extracting and analyzing latent topics in text in an unsupervised way.
    Page 1, “Abstract”
  2. One common deficiency of existing topic models , though, is that they would not work well for extracting cross-lingual latent topics simply because words in different languages generally do not co-occur with each other.
    Page 1, “Abstract”
  3. In this paper, we propose a way to incorporate a bilingual dictionary into a probabilistic topic model so that we can apply topic models to extract shared latent topics in text data of different languages.
    Page 1, “Abstract”
  4. Specifically, we propose a new topic model called Probabilistic Cross-Lingual Latent Semantic Analysis (PCLSA) which extends the Probabilistic Latent Semantic Analysis (PLSA) model by regularizing its likelihood function with soft constraints defined based on a bilingual dictionary.
    Page 1, “Abstract”
  5. As a robust unsupervised way to perform shallow latent semantic analysis of topics in text, probabilistic topic models (Hofmann, 1999a; Blei et al., 2003b) have recently attracted much attention.
    Page 1, “Introduction”
  6. Although many topic models have been proposed and shown to be useful (see Section 2 for more detailed discussion of related work), most of them share a common deficiency: they are designed to work only for monolingual text data and would not work well for extracting cross-lingual latent topics, i.e.
    Page 1, “Introduction”
  7. In this paper, we propose a novel topic model , called Probabilistic Cross-Lingual Latent Semantic Analysis (PCLSA) model, which can be used to mine shared latent topics from unaligned text data in different languages.
    Page 1, “Introduction”
  8. Both used a bilingual dictionary to bridge the language gap in a topic model .
    Page 2, “Introduction”
  9. Many topic models have been proposed, and the two basic models are the Probabilistic Latent Semantic Analysis (PLSA) model (Hofmann, 1999a) and the Latent Dirichlet Allocation (LDA) model (Blei et al., 2003b).
    Page 2, “Related Work”
  10. They and their extensions have been successfully applied to many problems, including hierarchical topic extraction (Hofmann, 1999b; Blei et al., 2003a; Li and McCallum, 2006), author-topic modeling (Steyvers et al., 2004), contextual topic analysis (Mei and Zhai, 2006), dynamic and correlated topic models (Blei and Lafferty, 2005; Blei and Lafferty, 2006), and opinion analysis (Mei et al., 2007; Branavan et al., 2008).
    Page 2, “Related Work”
  11. Some previous work on multilingual topic models assume documents in multiple languages are aligned either at the document level, sentence level or by time stamps (Mimno et al., 2009; Zhao and Xing, 2006; Kim and Khudanpur, 2004; Ni et al., 2009; Wang et al., 2007).
    Page 2, “Related Work”

See all papers in Proc. ACL 2010 that mention topic models.

See all papers in Proc. ACL that mention topic models.

Back to top.

Latent Semantic

Appears in 7 sentences as: Latent Semantic (6) latent semantic (2)
In Cross-Lingual Latent Topic Extraction
  1. Specifically, we propose a new topic model called Probabilistic Cross-Lingual Latent Semantic Analysis (PCLSA) which extends the Probabilistic Latent Semantic Analysis (PLSA) model by regularizing its likelihood function with soft constraints defined based on a bilingual dictionary.
    Page 1, “Abstract”
  2. As a robust unsupervised way to perform shallow latent semantic analysis of topics in text, probabilistic topic models (Hofmann, 1999a; Blei et al., 2003b) have recently attracted much attention.
    Page 1, “Introduction”
  3. In this paper, we propose a novel topic model, called Probabilistic Cross-Lingual Latent Semantic Analysis (PCLSA) model, which can be used to mine shared latent topics from unaligned text data in different languages.
    Page 1, “Introduction”
  4. PCLSA extends the Probabilistic Latent Semantic Analysis (PLSA) model by regularizing its likelihood function with soft constraints defined based on a bilingual dictionary.
    Page 1, “Introduction”
  5. Many topic models have been proposed, and the two basic models are the Probabilistic Latent Semantic Analysis (PLSA) model (Hofmann, 1999a) and the Latent Dirichlet Allocation (LDA) model (Blei et al., 2003b).
    Page 2, “Related Work”
  6. In this section, we present our probabilistic cross-lingual latent semantic analysis (PCLSA) model and discuss how it can be used to extract cross-lingual topics from multilingual text data.
    Page 3, “Probabilistic Cross-Lingual Latent Semantic Analysis”
  7. the Probabilistic Cross-Lingual Latent Semantic Analysis (PCLSA) model) that can incorporate translation knowledge in bilingual dictionaries as a regularizer to constrain the parameter estimation so that the learned topic models would be synchronized in multiple languages.
    Page 8, “Conclusion”

See all papers in Proc. ACL 2010 that mention Latent Semantic.

See all papers in Proc. ACL that mention Latent Semantic.

Back to top.

soft constraints

Appears in 5 sentences as: soft constraint (1) soft constraints (3) “soft constraints” (1)
In Cross-Lingual Latent Topic Extraction
  1. Specifically, we propose a new topic model called Probabilistic Cross-Lingual Latent Semantic Analysis (PCLSA) which extends the Probabilistic Latent Semantic Analysis (PLSA) model by regularizing its likelihood function with soft constraints defined based on a bilingual dictionary.
    Page 1, “Abstract”
  2. PCLSA extends the Probabilistic Latent Semantic Analysis (PLSA) model by regularizing its likelihood function with soft constraints defined based on a bilingual dictionary.
    Page 1, “Introduction”
  3. In our model, since we only add a soft constraint on word pairs in the dictionary, their probabilities in common topics are generally different, naturally capturing which shows the different variations of a common topic in different languages.
    Page 2, “Introduction”
  4. corporating the knowledge of a bilingual dictionary as soft constraints .
    Page 2, “Related Work”
  5. We achieve this by adding such preferences formally to the likelihood function of a probabilistic topic model as “soft constraints” so that when we estimate the model, we would try to not only fit the text data well (which is necessary to extract coherent component topics from each language), but also satisfy our specified preferences (which would ensure the extracted component topics in different languages are semantically related).
    Page 3, “Probabilistic Cross-Lingual Latent Semantic Analysis”

See all papers in Proc. ACL 2010 that mention soft constraints.

See all papers in Proc. ACL that mention soft constraints.

Back to top.

objective function

Appears in 4 sentences as: objective function (4)
In Cross-Lingual Latent Topic Extraction
  1. Putting L(C) and R(C) together, we would like to maximize the following objective function which is a regularized log-likelihood:
    Page 4, “Probabilistic Cross-Lingual Latent Semantic Analysis”
  2. Specifically, we will search for a set of values for all our parameters that can maximize the objective function defined above.
    Page 4, “Probabilistic Cross-Lingual Latent Semantic Analysis”
  3. However, there is no closed form solution in the M-step for the whole objective function .
    Page 5, “Probabilistic Cross-Lingual Latent Semantic Analysis”
  4. If there is 110 \I’n+1 81- Q(q’n+1;\1’n) Z Q(‘I’n; $77,), then we consider \Iln to be the local maximum point of the objective function Eq.
    Page 5, “Probabilistic Cross-Lingual Latent Semantic Analysis”

See all papers in Proc. ACL 2010 that mention objective function.

See all papers in Proc. ACL that mention objective function.

Back to top.

word pairs

Appears in 4 sentences as: word pair (1) word pairs (3)
In Cross-Lingual Latent Topic Extraction
  1. However, the goals of their work are different from ours in that their models mainly focus on mining cross-lingual topics of matching word pairs and discovering the correspondence at the vocabulary level.
    Page 2, “Introduction”
  2. Therefore, the topics extracted using their model cannot indicate how a common topic is covered diflerently in the two languages, because the words in each word pair share the same probability in a common topic.
    Page 2, “Introduction”
  3. In our model, since we only add a soft constraint on word pairs in the dictionary, their probabilities in common topics are generally different, naturally capturing which shows the different variations of a common topic in different languages.
    Page 2, “Introduction”
  4. Thus when a cross-lingual topic picks up words that co-occur in monolingual text, it would prefer picking up word pairs whose translations in other languages also co-occur with each other, giving us a coherent multilingual word distribution that characterizes well the content of text in different languages.
    Page 4, “Probabilistic Cross-Lingual Latent Semantic Analysis”

See all papers in Proc. ACL 2010 that mention word pairs.

See all papers in Proc. ACL that mention word pairs.

Back to top.