Abstract | The approach uses unlabeled documents, along with a simple word translation oracle, in order to induce task-specific, cross-lingual word correspondences. |
Cross-Language Structural Correspondence Learning | The resulting classifier fST, which will operate in the cross-lingual setting, is defined as follows: |
Cross-Language Text Classification | One way to overcome this “feature barrier” is to find a cross-lingual representation for documents written in S and ’2', which enables the transfer of classification knowledge between the two languages. |
Cross-Language Text Classification | Intuitively, one can understand such a cross-lingual representation as a concept space that underlies both languages. |
Cross-Language Text Classification | In the following, we will use 6 to denote a map that associates the original |V|-dimensional representation of a document d written in S or ’2' with its cross-lingual representation. |
Experiments | In this study, we are interested in whether the cross-lingual representation induced by CL-SCL captures the difference between positive and negative reviews; by balancing the reviews we ensure that the imbalance does not affect the learned model. |
Experiments | Since SGD is sensitive to feature scaling the projection 6X is post-processed as follows: (1) Each feature of the cross-lingual representation is standardized to zero mean and unit variance, where mean and variance are estimated on D5 U Du. |
Experiments | (2) The cross-lingual document representations are scaled by a constant 04 such that IDsl‘l 2x619. |
Introduction | As we will see, a small number of pivots can capture a sufficiently large part of the correspondences between S and ’2' in order to (1) construct a cross-lingual representation and (2) learn a classifier fly for the task 7 that operates on this representation. |
Introduction | Third, an in-depth analysis with respect to important hyperparameters such as the ratio of labeled and unlabeled documents, the number of pivots, and the optimum dimensionality of the cross-lingual representation. |
Abstract | One common deficiency of existing topic models, though, is that they would not work well for extracting cross-lingual latent topics simply because words in different languages generally do not co-occur with each other. |
Abstract | Specifically, we propose a new topic model called Probabilistic Cross-Lingual Latent Semantic Analysis (PCLSA) which extends the Probabilistic Latent Semantic Analysis (PLSA) model by regularizing its likelihood function with soft constraints defined based on a bilingual dictionary. |
Abstract | Both qualitative and quantitative experimental results show that the PCLSA model can effectively extract cross-lingual latent topics from multilingual text data. |
Introduction | Although many topic models have been proposed and shown to be useful (see Section 2 for more detailed discussion of related work), most of them share a common deficiency: they are designed to work only for monolingual text data and would not work well for extracting cross-lingual latent topics, i.e. |
Introduction | In this paper, we propose a novel topic model, called Probabilistic Cross-Lingual Latent Semantic Analysis (PCLSA) model, which can be used to mine shared latent topics from unaligned text data in different languages. |
Introduction | However, the goals of their work are different from ours in that their models mainly focus on mining cross-lingual topics of matching word pairs and discovering the correspondence at the vocabulary level. |
Problem Formulation | In general, the problem of cross-lingual topic extraction can be defined as to extract a set of common cross-lingual latent topics covered in text collections in different natural languages. |
Problem Formulation | A cross-lingual latent topic will be represented as a multinomial word distribution over the words in all the languages, i.e. |
Related Work | (Fung, 1995; Franz et al., 1998; Masuichi et al., 2000; Sadat et al., 2003; Gliozzo and Strapparava, 2006)), but most previous work aims at acquiring word translation knowledge or cross-lingual text categorization from comparable corpora. |
Abstract | Our algorithm introduces nonaligned signatures (NAS), a cross-lingual word context similarity score that avoids the over-constrained and inefficient nature of alignment-based methods. |
Conclusion | At the heart of our method is the nonaligned signatures (NAS) context similarity score, used for removing incorrect translations using cross-lingual co-occurrences. |
Conclusion | It would be interesting to further investigate this observation with other sources of lexicons (e.g., obtained from parallel or comparable corpora) and for other tasks, such as cross-lingual word sense disambiguation and information retrieval. |
Previous Work | 2.3 Cross-lingual Co-occurrences in Lexicon Construction |
Previous Work | Rapp (1999) and Fung (1998) discussed semantic similarity estimation using cross-lingual context vector alignment. |
Previous Work | Using cross-lingual co-occurrences to improve a lexicon generated using a pivot language was suggested by Tanaka and Iwasaki (1996). |