Deciphering Foreign Language by Combining Language Models and Context Vectors
Nuhn, Malte and Mauser, Arne and Ney, Hermann

Article Structure

Abstract

In this paper we show how to train statistical machine translation systems on real-life tasks using only nonparallel monolingual data from two languages.

Introduction

It has long been a Vision of science fiction writers and scientists to be able to universally communicate in all languages.

Related Work

Unsupervised training of statistical translations systems without parallel data and related problems have been addressed before.

Translation Model

In this section, we describe the statistical training criterion and the translation model that is trained using monolingual data.

Monolingual Context Similarity

As described in Section 3 we need some mechanism to iteratively choose an active set of translation candidates.

Training Algorithm and Implementation

Given the model presented in Section 3 and the methods illustrated in Section 4, we now describe how to train this model.

Experimental Evaluation

We evaluate our method on three different corpora.

Conclusion

We presented a method for learning statistical machine translation models from nonparallel data.

Topics

LM

Appears in 27 sentences as: LM (29)
In Deciphering Foreign Language by Combining Language Models and Context Vectors
  1. Our FST representation of the LM makes use of failure transitions as described in (Allauzen et al., 2003).
    Page 5, “Training Algorithm and Implementation”
  2. Using a 2-gram LM they obtain 15.3 BLEU and with a whole segment LM , they achieve 19.3 BLEU.
    Page 6, “Experimental Evaluation”
  3. In comparison to this baseline we run our algorithm with N0 = 50 candidates per source word for both, a 2-gram and a 3- gram LM .
    Page 6, “Experimental Evaluation”
  4. Figure 3 and Figure 4 show the evolution of BLEU and TER scores for applying our method using a 2-gram and a 3- gram LM .
    Page 6, “Experimental Evaluation”
  5. In case of the 2-gram LM (Figure 3) the translation quality increases until it reaches a plateau after 5 EM+Context cycles.
    Page 6, “Experimental Evaluation”
  6. In case of the S-gram LM (Figure 4) the statement only holds with respect to TER.
    Page 6, “Experimental Evaluation”
  7. Figure 3: Results on the OPUS corpus with a 2-gram LM , NC = 50, and 30 EM iterations between each context vector step.
    Page 6, “Experimental Evaluation”
  8. The dashed line shows the best result using a 2-gram LM in (Ravi and Knight, 2011).
    Page 6, “Experimental Evaluation”
  9. 5(Ravi and Knight, 2011) only report results using a 2— gram LM and a whole—segment LM .
    Page 6, “Experimental Evaluation”
  10. Figure 4: Results on the OPUS corpus with a 3-gram LM , NC = 50, and 30 EM iterations between each context vector step.
    Page 7, “Experimental Evaluation”
  11. The dashed line shows the best result using a whole-segment LM in (Ravi and Knight, 2011)
    Page 7, “Experimental Evaluation”

See all papers in Proc. ACL 2012 that mention LM.

See all papers in Proc. ACL that mention LM.

Back to top.

BLEU

Appears in 16 sentences as: BLEU (21)
In Deciphering Foreign Language by Combining Language Models and Context Vectors
  1. They perform experiments on a SpanislflEnglish task with vocabulary sizes of about 500 words and achieve a performance of around 20 BLEU compared to 70 BLEU obtained by a system that was trained on parallel data.
    Page 2, “Related Work”
  2. We show that our method performs better by 1.6 BLEU than the best performing method described in (Ravi and Knight, 2011) while
    Page 5, “Experimental Evaluation”
  3. In case of the OPUS and VERBMOBIL corpus, we evaluate the results using BLEU (Papineni et al., 2002) and TER (Snover et al., 2006) to reference translations.
    Page 6, “Experimental Evaluation”
  4. For BLEU higher values are better, for TER lower values are better.
    Page 6, “Experimental Evaluation”
  5. Using a 2-gram LM they obtain 15.3 BLEU and with a whole segment LM, they achieve 19.3 BLEU .
    Page 6, “Experimental Evaluation”
  6. Figure 3 and Figure 4 show the evolution of BLEU and TER scores for applying our method using a 2-gram and a 3- gram LM.
    Page 6, “Experimental Evaluation”
  7. l l l l l l Full EM best (BLEU) + BLEU 16 " ; .. 3.. -7 78
    Page 6, “Experimental Evaluation”
  8. Our S-gram based method performs by 1.6 BLEU better than their best system which is a statistically significant improvement at 95% confidence level.
    Page 6, “Experimental Evaluation”
  9. 24 W + BLEU 22 : TER W 72 W Full EM best ( BLEU ) 20 ~ 4% 1 W 70 ____ "'"_'537._.._.:.
    Page 7, “Experimental Evaluation”
  10. Method CPU BLEU TER EM, 2—gram LM 411 cand.
    Page 7, “Experimental Evaluation”
  11. 6Estimated by running full EM using the 2—gram LM using our implementation for 90 Iterations yielding 15.2 BLEU .
    Page 7, “Experimental Evaluation”

See all papers in Proc. ACL 2012 that mention BLEU.

See all papers in Proc. ACL that mention BLEU.

Back to top.

translation model

Appears in 15 sentences as: translation model (7) translation models (6) translation model’s (2)
In Deciphering Foreign Language by Combining Language Models and Context Vectors
  1. In this work, we attempt to learn statistical translation models from only monolingual data in the source and target language.
    Page 1, “Introduction”
  2. This work is a big step towards large-scale and large-vocabulary unsupervised training of statistical translation models .
    Page 1, “Introduction”
  3. In this work, we will develop, describe, and evaluate methods for large vocabulary unsupervised learning of machine translation models suitable for real-world tasks.
    Page 2, “Introduction”
  4. Their best performing approach uses an EM-Algorithm to train a generative word based translation model .
    Page 2, “Related Work”
  5. In this section, we describe the statistical training criterion and the translation model that is trained using monolingual data.
    Page 2, “Translation Model”
  6. As training criterion for the translation model’s parameters 6, Ravi and Knight (2011) suggest
    Page 2, “Translation Model”
  7. This becomes increasingly difficult with more complex translation models .
    Page 2, “Translation Model”
  8. translation model that still contains all basic phenomena of a generic translation process.
    Page 3, “Translation Model”
  9. Instead we only allow translation models where for each source word f the number of words 6’ with P( f |e’ ) 7E 0 is below some fixed value.
    Page 3, “Translation Model”
  10. We will refer to this value as the maximum number of candidates of the translation model and denote it with N0.
    Page 3, “Translation Model”
  11. The second task can be achieved by running the EM algorithm on the restricted translation model .
    Page 3, “Translation Model”

See all papers in Proc. ACL 2012 that mention translation model.

See all papers in Proc. ACL that mention translation model.

Back to top.

TER

Appears in 9 sentences as: TER (9)
In Deciphering Foreign Language by Combining Language Models and Context Vectors
  1. In case of the OPUS and VERBMOBIL corpus, we evaluate the results using BLEU (Papineni et al., 2002) and TER (Snover et al., 2006) to reference translations.
    Page 6, “Experimental Evaluation”
  2. For BLEU higher values are better, for TER lower values are better.
    Page 6, “Experimental Evaluation”
  3. Figure 3 and Figure 4 show the evolution of BLEU and TER scores for applying our method using a 2-gram and a 3- gram LM.
    Page 6, “Experimental Evaluation”
  4. In case of the S-gram LM (Figure 4) the statement only holds with respect to TER .
    Page 6, “Experimental Evaluation”
  5. It is notable that during the first iterations TER only improves very little until a large chunk of the language unravels after the third iteration.
    Page 6, “Experimental Evaluation”
  6. TER
    Page 6, “Experimental Evaluation”
  7. 24 W + BLEU 22 : TER W 72 W Full EM best (BLEU) 20 ~ 4% 1 W 70 ____ "'"_'537._.._.:.
    Page 7, “Experimental Evaluation”
  8. Method CPU BLEU TER EM, 2—gram LM 411 cand.
    Page 7, “Experimental Evaluation”
  9. Method BLEU TER
    Page 7, “Experimental Evaluation”

See all papers in Proc. ACL 2012 that mention TER.

See all papers in Proc. ACL that mention TER.

Back to top.

parallel data

Appears in 6 sentences as: parallel data (6)
In Deciphering Foreign Language by Combining Language Models and Context Vectors
  1. Unsupervised training of statistical translations systems without parallel data and related problems have been addressed before.
    Page 2, “Related Work”
  2. Close to the methods described in this work, Ravi and Knight (2011) treat training and translation without parallel data as a deciphering problem.
    Page 2, “Related Work”
  3. They perform experiments on a SpanislflEnglish task with vocabulary sizes of about 500 words and achieve a performance of around 20 BLEU compared to 70 BLEU obtained by a system that was trained on parallel data .
    Page 2, “Related Work”
  4. They evaluate their method on a variety of tasks, ranging from inherently parallel data (EUROPARL) to unrelated corpora (100k: sentences of the GIGAWORD corpus).
    Page 2, “Related Work”
  5. We also compare the results on these corpora to a system trained on parallel data .
    Page 6, “Experimental Evaluation”
  6. Och (2002) reports results of 48.2 BLEU for a single-word based translation system and 56.1 BLEU using the alignment template approach, both trained on parallel data .
    Page 7, “Experimental Evaluation”

See all papers in Proc. ACL 2012 that mention parallel data.

See all papers in Proc. ACL that mention parallel data.

Back to top.

language model

Appears in 5 sentences as: language model (4) Language Models (1)
In Deciphering Foreign Language by Combining Language Models and Context Vectors
  1. On the task shown in (Ravi and Knight, 2011) we obtain better results with only 5% of the computational effort when running our method with an n-gram language model .
    Page 1, “Abstract”
  2. Combining Language Models and
    Page 1, “Introduction”
  3. Stochastically generate the target sentence according to an n-gram language model .
    Page 3, “Translation Model”
  4. As described in Section 4, the overall procedure is divided into two alternating steps: After initialization we first perform EM training of the translation model for 20-30 iterations using a 2- gram or S-gram language model in the target language.
    Page 4, “Training Algorithm and Implementation”
  5. The generative story described in Section 3 is implemented as a cascade of a permutation, insertion, lexicon, deletion and language model finite state transducers using OpenFST (Allauzen et al., 2007).
    Page 5, “Training Algorithm and Implementation”

See all papers in Proc. ACL 2012 that mention language model.

See all papers in Proc. ACL that mention language model.

Back to top.

machine translation

Appears in 4 sentences as: machine translation (4)
In Deciphering Foreign Language by Combining Language Models and Context Vectors
  1. In this paper we show how to train statistical machine translation systems on real-life tasks using only nonparallel monolingual data from two languages.
    Page 1, “Abstract”
  2. In this work, we will develop, describe, and evaluate methods for large vocabulary unsupervised learning of machine translation models suitable for real-world tasks.
    Page 2, “Introduction”
  3. We presented a method for learning statistical machine translation models from nonparallel data.
    Page 8, “Conclusion”
  4. This work serves as a big step towards large-scale unsupervised training for statistical machine translation systems.
    Page 8, “Conclusion”

See all papers in Proc. ACL 2012 that mention machine translation.

See all papers in Proc. ACL that mention machine translation.

Back to top.

n-gram

Appears in 4 sentences as: n-gram (4)
In Deciphering Foreign Language by Combining Language Models and Context Vectors
  1. On the task shown in (Ravi and Knight, 2011) we obtain better results with only 5% of the computational effort when running our method with an n-gram language model.
    Page 1, “Abstract”
  2. Stochastically generate the target sentence according to an n-gram language model.
    Page 3, “Translation Model”
  3. being approximately 15 to 20 times faster than their n-gram based approach.
    Page 6, “Experimental Evaluation”
  4. To summarize: Our method is significantly faster than n-gram LM based approaches and obtains better results than any previously published method.
    Page 7, “Experimental Evaluation”

See all papers in Proc. ACL 2012 that mention n-gram.

See all papers in Proc. ACL that mention n-gram.

Back to top.

translation systems

Appears in 4 sentences as: translation system (1) translation systems (2) translations systems (1)
In Deciphering Foreign Language by Combining Language Models and Context Vectors
  1. In this paper we show how to train statistical machine translation systems on real-life tasks using only nonparallel monolingual data from two languages.
    Page 1, “Abstract”
  2. Unsupervised training of statistical translations systems without parallel data and related problems have been addressed before.
    Page 2, “Related Work”
  3. Och (2002) reports results of 48.2 BLEU for a single-word based translation system and 56.1 BLEU using the alignment template approach, both trained on parallel data.
    Page 7, “Experimental Evaluation”
  4. This work serves as a big step towards large-scale unsupervised training for statistical machine translation systems .
    Page 8, “Conclusion”

See all papers in Proc. ACL 2012 that mention translation systems.

See all papers in Proc. ACL that mention translation systems.

Back to top.

statistical machine translation

Appears in 3 sentences as: statistical machine translation (3)
In Deciphering Foreign Language by Combining Language Models and Context Vectors
  1. In this paper we show how to train statistical machine translation systems on real-life tasks using only nonparallel monolingual data from two languages.
    Page 1, “Abstract”
  2. We presented a method for learning statistical machine translation models from nonparallel data.
    Page 8, “Conclusion”
  3. This work serves as a big step towards large-scale unsupervised training for statistical machine translation systems.
    Page 8, “Conclusion”

See all papers in Proc. ACL 2012 that mention statistical machine translation.

See all papers in Proc. ACL that mention statistical machine translation.

Back to top.