Building Comparable Corpora Based on Bilingual LDA Model

Comparable corpora can be mined fine-grained translation equivalents, such as bilingual terminologies, named entities and parallel sentences, to support the bilingual lexicography, statistical machine translation and cross-language information retrieval (Abdul-Rauf et al., 2009).

2.1 Standard LDA

Based on the bilingual LDA model, building comparable corpora includes several steps to

4.1 Datasets and Evaluation

- Based on Bilingual LDA ModelPage 1, “Introduction”
- The paper concretely includes: 1) Introduce the Bilingual LDA (Latent Dirichlet Allocation) model which builds comparable corpora and improves the efficiency of matching similar documents; 2) Design a novel method of TFIDF (Topic Frequency-Inverse Document Frequency) to enhance the distinguishing ability of topics from different documents; 3) Propose a tailoredPage 1, “Introduction”
- 2.1 Standard LDAPage 2, “Bilingual LDA Model”
- LDA model (Blei et al., 2003) represents the latent topic of the document distribution by Dirichlet distribution with a K-dimensional implicit random variable, which is transformed into a complete generative model when ,8 is exerted to Dirichlet distribution (Griffiths et al., 2004) (Shown in Fig.Page 2, “Bilingual LDA Model”
- Figure 1: Standard LDA modelPage 2, “Bilingual LDA Model”
- 2.2 Bilingual LDAPage 2, “Bilingual LDA Model”
- Bilingual LDA is a bilingual extension of a standard LDA model.Page 2, “Bilingual LDA Model”
- Figure 2: Bilingual LDA modelPage 2, “Bilingual LDA Model”
- Based on the bilingual LDA model, building comparable corpora includes several steps toPage 2, “Building comparable corpora”

- Preiss (2012) transformed the source language topical model to the target language and classified probability distribution of topics in the same language, whose shortcoming is that the effect of model translation seriously hampers the comparable corpora quality.Page 1, “Introduction”
- (2009) adapted monolingual topic model to bilingual topic model in which the documents of a concept unit in different languages were assumed to share identical topic distribution.Page 1, “Introduction”
- Bilingual topic model is widely adopted to mine translation equivalents from multi-language documents (Mimno et al., 2009;Ivaneta1,2011)Page 1, “Introduction”
- Based on the bilingual topic model , this paper predicts the topical structure of documents in different languages and calculates the similarity of topics over documents to build comparable corpora.Page 1, “Introduction”
- generate the bilingual topic model (0” from thePage 2, “Building comparable corpora”

- method of conditional probability to calculate document similarity; 4) Address a language-independent study which isn’t limited to a particular data source in any language.Page 2, “Introduction”
- 3.2 Conditional ProbabilityPage 3, “Building comparable corpora”
- The similarity between 7715 and 771T is defined as the Conditional Probability (CP) of documentsPage 3, “Building comparable corpora”

- Preiss (2012) transformed the source language topical model to the target language and classified probability distribution of topics in the same language, whose shortcoming is that the effect of model translation seriously hampers the comparable corpora quality.Page 1, “Introduction”
- denotes the vocabulary probability distribution in the topic k; M denotes the document number; 6mPage 2, “Bilingual LDA Model”
- denotes the topic probability distribution in the document m; Nm denotes the length of m; mePage 2, “Bilingual LDA Model”

- (2009) adapted monolingual topic model to bilingual topic model in which the documents of a concept unit in different languages were assumed to share identical topic distribution .Page 1, “Introduction”
- given bilingual corpora, predict the topic distribution (9m, kof the new documents, calculate thePage 2, “Building comparable corpora”
- P(Z) as prior topic distribution is assumed aPage 3, “Building comparable corpora”

