Learning Bilingual Lexicons from Monolingual Corpora
Haghighi, Aria and Liang, Percy and Berg-Kirkpatrick, Taylor and Klein, Dan

Article Structure

Abstract

We present a method for learning bilingual translation lexicons from monolingual corpora.

Introduction

Current statistical machine translation systems use parallel corpora to induce translation correspondences, whether those correspondences be at the level of phrases (Koehn, 2004), treelets (Galley et al., 2006), or simply single words (Brown et al., 1994).

Bilingual Lexicon Induction

As input, we are given a monolingual corpus S (a sequence of word tokens) in a source language and a monolingual corpus T in a target language.

Inference

Given our probabilistic model, we would like to maximize the log-likelihood of the observed data

Experimental Setup

In section 5, we present developmental experiments in English-Spanish lexicon induction; experiments

Features

In this section, we explore feature representations of word types in our model.

Experiments

In this section we examine how system performance varies when crucial elements are altered.

Analysis

We have presented a novel generative model for bilingual lexicon induction and presented results under a variety of data conditions (section 6.1) and languages (section 6.3) showing that our system can produce accurate lexicons even in highly adverse conditions.

Conclusion

We have presented a generative model for bilingual lexicon induction based on probabilistic CCA.

Topics

language pairs

Appears in 10 sentences as: language pair (1) language pairs (9) languages pairs (1)
In Learning Bilingual Lexicons from Monolingual Corpora
  1. We show that high—precision lexicons can be learned in a variety of language pairs and from a range of corpus types.
    Page 1, “Abstract”
  2. Although parallel text is plentiful for some language pairs such as English-Chinese or English-Arabic, it is scarce or even nonexistent for most others, such as English-Hindi or French-Japanese.
    Page 1, “Introduction”
  3. Moreover, parallel text could be scarce for a language pair even if monolingual data is readily available for both languages.
    Page 1, “Introduction”
  4. This task, though clearly more difficult than the standard parallel text approach, can operate on language pairs and in domains where standard approaches cannot.
    Page 1, “Introduction”
  5. This setting has been considered before, most notably in Koehn and Knight (2002) and Fung (1995), but the current paper is the first to use a probabilistic model and present results across a variety of language pairs and data conditions.
    Page 1, “Introduction”
  6. all languages pairs except English-Arabic, we extract evaluation lexicons from the Wiktionary online dictionary.
    Page 5, “Experimental Setup”
  7. While orthographic features are clearly effective for historically related language pairs, they are more limited for other language pairs , where we need to appeal to other clues.
    Page 6, “Features”
  8. (section 6.2), (c) a variety of language pairs (see section 6.3).
    Page 6, “Features”
  9. We also explored how system performance varies for language pairs other than English-Spanish.
    Page 7, “Experiments”
  10. One concern is how our system performs on language pairs where orthographic features are less applicable.
    Page 7, “Experiments”

See all papers in Proc. ACL 2008 that mention language pairs.

See all papers in Proc. ACL that mention language pairs.

Back to top.

feature vectors

Appears in 8 sentences as: feature vector (3) feature vectors (5)
In Learning Bilingual Lexicons from Monolingual Corpora
  1. In our method, we represent each language as a monolingual lexicon (see figure 2): a list of word types characterized by monolingual feature vectors , such as context counts, orthographic substrings, and so on (section 5).
    Page 1, “Introduction”
  2. Then, for each matched pair of word types (2', j ) E m, we need to generate the observed feature vectors of the source and target word types, fs(si) 6 Rd5 and fT(tj) E RdT.
    Page 2, “Bilingual Lexicon Induction”
  3. The feature vector of each word type is computed from the appropriate monolingual corpus and summarizes the word’s monolingual characteristics; see section 5 for details and figure 2 for an illustration.
    Page 2, “Bilingual Lexicon Induction”
  4. Specifically, to generate the feature vectors , we first generate a random concept 2M N N(0, Id), where Id is the d x d identity matrix.
    Page 2, “Bilingual Lexicon Induction”
  5. The source feature vector f3(si) is drawn from a multivariate Gaussian with mean Wszm and covariance \IIS, where W3 is a d3 x d matrix which transforms the language-independent concept 2M into a language-dependent vector in the source space.
    Page 2, “Bilingual Lexicon Induction”
  6. If two word types are truly translations, it will be better to relate their feature vectors through the latent space than to explain them independently via the baseline distribution.
    Page 2, “Bilingual Lexicon Induction”
  7. 4Since ds and dT can be quite large in practice and often greater than lml, we use Cholesky decomposition to re-represent the feature vectors as lml-dimensional vectors with the same dot products, which is all that CCA depends on.
    Page 3, “Inference”
  8. For a concrete example of a word type to feature vector mapping, see figure 2.
    Page 5, “Features”

See all papers in Proc. ACL 2008 that mention feature vectors.

See all papers in Proc. ACL that mention feature vectors.

Back to top.

generative model

Appears in 6 sentences as: Generative Model (1) generative model (5)
In Learning Bilingual Lexicons from Monolingual Corpora
  1. Translations are induced using a generative model based on canonical correlation analysis, which explains the monolingual lexicons in terms of latent matchings.
    Page 1, “Abstract”
  2. We define a generative model over (1) a source lexicon, (2) a target lexicon, and (3) a matching between them (section 2).
    Page 1, “Introduction”
  3. 2.1 Generative Model
    Page 2, “Bilingual Lexicon Induction”
  4. We propose the following generative model over matchings m and word types (5,13), which we call matching canonical correlation analysis (MCCA).
    Page 2, “Bilingual Lexicon Induction”
  5. We have presented a novel generative model for bilingual lexicon induction and presented results under a variety of data conditions (section 6.1) and languages (section 6.3) showing that our system can produce accurate lexicons even in highly adverse conditions.
    Page 8, “Analysis”
  6. We have presented a generative model for bilingual lexicon induction based on probabilistic CCA.
    Page 8, “Conclusion”

See all papers in Proc. ACL 2008 that mention generative model.

See all papers in Proc. ACL that mention generative model.

Back to top.

edit distance

Appears in 4 sentences as: edit distance (3) edit distances (1)
In Learning Bilingual Lexicons from Monolingual Corpora
  1. The second method is to heuristically induce, where applicable, a seed lexicon using edit distance , as is done in Koehn and Knight (2002).
    Page 5, “Experimental Setup”
  2. Where applicable, we compare against the EDITDIST baseline, which solves a maximum bipartite matching problem where edge weights are normalized edit distances .
    Page 5, “Experimental Setup”
  3. One direct way to capture orthographic similarity between word pairs is edit distance .
    Page 5, “Features”
  4. Note that MCCA can learn regular orthographic correspondences between source and target words, which is something edit distance cannot capture (see table 5).
    Page 6, “Features”

See all papers in Proc. ACL 2008 that mention edit distance.

See all papers in Proc. ACL that mention edit distance.

Back to top.

parallel corpora

Appears in 4 sentences as: parallel corpora (4)
In Learning Bilingual Lexicons from Monolingual Corpora
  1. Current statistical machine translation systems use parallel corpora to induce translation correspondences, whether those correspondences be at the level of phrases (Koehn, 2004), treelets (Galley et al., 2006), or simply single words (Brown et al., 1994).
    Page 1, “Introduction”
  2. 0 EN —AR—D: English: lst 50k sentences of 1994 proceedings of UN parallel corpora ;9 Arabic: 2nd 50k sentences.
    Page 4, “Experimental Setup”
  3. For English-Arabic, we extract a lexicon from 100k parallel sentences of UN parallel corpora by running the HMM intersected alignment model (Liang et al., 2008), adding (3, t) to the lexicon if s was aligned to t at least three times and more than any other word.
    Page 5, “Experimental Setup”
  4. Our experiments show that high-precision translations can be mined without any access to parallel corpora .
    Page 8, “Conclusion”

See all papers in Proc. ACL 2008 that mention parallel corpora.

See all papers in Proc. ACL that mention parallel corpora.

Back to top.

feature set

Appears in 3 sentences as: feature set (2) features sets (1)
In Learning Bilingual Lexicons from Monolingual Corpora
  1. As an example of the performance of the system, in English-Spanish induction with our best feature set , using corpora derived from topically similar but nonparallel sources, the system obtains 89.0% precision at 33% recall.
    Page 1, “Introduction”
  2. Table 1: Performance of EDITDIST and our model with various features sets on EN -ES-W. See section 5.
    Page 5, “Experimental Setup”
  3. We will use MCCA (for matching CCA) to denote our model using the optimal feature set (see section 5.3).
    Page 5, “Experimental Setup”

See all papers in Proc. ACL 2008 that mention feature set.

See all papers in Proc. ACL that mention feature set.

Back to top.

parallel sentences

Appears in 3 sentences as: parallel sentences (3)
In Learning Bilingual Lexicons from Monolingual Corpora
  1. 7Note that the although the corpora here are derived from a parallel corpus, there are no parallel sentences .
    Page 4, “Experimental Setup”
  2. 10These corpora contain no parallel sentences .
    Page 4, “Experimental Setup”
  3. For English-Arabic, we extract a lexicon from 100k parallel sentences of UN parallel corpora by running the HMM intersected alignment model (Liang et al., 2008), adding (3, t) to the lexicon if s was aligned to t at least three times and more than any other word.
    Page 5, “Experimental Setup”

See all papers in Proc. ACL 2008 that mention parallel sentences.

See all papers in Proc. ACL that mention parallel sentences.

Back to top.

semantically related

Appears in 3 sentences as: semantically related (3)
In Learning Bilingual Lexicons from Monolingual Corpora
  1. airport to aeropue rt 0 s), 30 were semantically related (e.g.
    Page 8, “Analysis”
  2. Of the true errors, the most common arose from semantically related words which had strong context feature correlations (see table 4(b)).
    Page 8, “Analysis”
  3. Here, the broad trend is for words which are either translations or semantically related across languages to be close in canonical space.
    Page 8, “Analysis”

See all papers in Proc. ACL 2008 that mention semantically related.

See all papers in Proc. ACL that mention semantically related.

Back to top.