Effective Selection of Translation Model Training Data
Liu, Le and Hong, Yu and Liu, Hao and Wang, Xing and Yao, Jianmin

Article Structure

Abstract

Data selection has been demonstrated to be an effective approach to addressing the lack of high-quality bitext for statistical machine translation in the domain of interest.

Introduction

Statistical machine translation depends heavily on large scale parallel corpora.

Related Work

The existing data selection methods are mostly based on language model.

Training Data Selection Methods

We present three data selection methods for ranking and selecting domain-relevant sentence pairs from general-domain corpus, with an eye towards improving domain-specific translation model performance.

Experiments

4.1 Corpora

Conclusion

We present three novel methods for translation model training data selection, which are based on the translation model and language model.

Topics

sentence pairs

Appears in 35 sentences as: sentence pair (9) sentence pairs (31)
In Effective Selection of Translation Model Training Data
  1. Most current data selection methods solely use language models trained on a small scale in-domain data to select domain-relevant sentence pairs from general-domain parallel corpus.
    Page 1, “Abstract”
  2. By contrast, we argue that the relevance between a sentence pair and target domain can be better evaluated by the combination of language model and translation model.
    Page 1, “Abstract”
  3. When the selected sentence pairs are evaluated on an end-to-end MT task, our methods can increase the translation performance by 3 BLEU points.
    Page 1, “Abstract”
  4. For this, an effective approach is to automatically select and eXpand domain-specific sentence pairs from large scale general-domain parallel corpus.
    Page 1, “Introduction”
  5. Current data selection methods mostly use language models trained on small scale in-domain data to measure domain relevance and select domain-relevant parallel sentence pairs to expand training corpora.
    Page 1, “Introduction”
  6. Meanwhile, the translation model measures the translation probability of sentence pair , being used to verify the parallelism of the selected domain-relevant bitext.
    Page 1, “Introduction”
  7. (2010) ranked the sentence pairs in the general-domain corpus according to the perplexity scores of sentences, which are computed with respect to in-domain language models.
    Page 1, “Related Work”
  8. Although previous works in data selection (Duh et al., 2013; Koehn and Haddow, 2012; Axelrod et al., 2011; Foster et al., 2010; Yasuda et al., 2008) have gained good performance, the methods which only adopt language models to score the sentence pairs are suboptimal.
    Page 2, “Related Work”
  9. The reason is that a sentence pair contains a source language sentence and a target language sentence, while the existing methods are incapable of evaluating the mutual translation probability of sentence pair in the target domain.
    Page 2, “Related Work”
  10. We present three data selection methods for ranking and selecting domain-relevant sentence pairs from general-domain corpus, with an eye towards improving domain-specific translation model performance.
    Page 2, “Training Data Selection Methods”
  11. However, in this paper, we adopt the translation model to evaluate the translation probability of sentence pair and develop a simple but effective variant of translation model to rank the sentence pairs in the general-domain corpus.
    Page 2, “Training Data Selection Methods”

See all papers in Proc. ACL 2014 that mention sentence pairs.

See all papers in Proc. ACL that mention sentence pairs.

Back to top.

language model

Appears in 31 sentences as: Language Model (1) Language model (1) language model (21) Language Modeling (1) Language Models (1) Language models (1) language models (8)
In Effective Selection of Translation Model Training Data
  1. Most current data selection methods solely use language models trained on a small scale in-domain data to select domain-relevant sentence pairs from general-domain parallel corpus.
    Page 1, “Abstract”
  2. By contrast, we argue that the relevance between a sentence pair and target domain can be better evaluated by the combination of language model and translation model.
    Page 1, “Abstract”
  3. Current data selection methods mostly use language models trained on small scale in-domain data to measure domain relevance and select domain-relevant parallel sentence pairs to expand training corpora.
    Page 1, “Introduction”
  4. To overcome the problem, we first propose the method combining translation model with language model in data selection.
    Page 1, “Introduction”
  5. The language model measures the domain-specif1c generation probability of sentences, being used to select domain-relevant sentences at both sides of source and target language.
    Page 1, “Introduction”
  6. The existing data selection methods are mostly based on language model .
    Page 1, “Related Work”
  7. (2010) ranked the sentence pairs in the general-domain corpus according to the perplexity scores of sentences, which are computed with respect to in-domain language models .
    Page 1, “Related Work”
  8. (2011) improved the perplexity-based approach and proposed bilingual cross-entropy difference as a ranking function with in-and general- domain language models .
    Page 1, “Related Work”
  9. 2011) and further explored neural language model for data selection rather than the conventional ngram language model .
    Page 2, “Related Work”
  10. Although previous works in data selection (Duh et al., 2013; Koehn and Haddow, 2012; Axelrod et al., 2011; Foster et al., 2010; Yasuda et al., 2008) have gained good performance, the methods which only adopt language models to score the sentence pairs are suboptimal.
    Page 2, “Related Work”
  11. Thus, we propose novel methods which are based on translation model and language model for data selection.
    Page 2, “Related Work”

See all papers in Proc. ACL 2014 that mention language model.

See all papers in Proc. ACL that mention language model.

Back to top.

translation model

Appears in 25 sentences as: Translation Model (2) Translation model (2) translation model (21) translation models (3)
In Effective Selection of Translation Model Training Data
  1. By contrast, we argue that the relevance between a sentence pair and target domain can be better evaluated by the combination of language model and translation model .
    Page 1, “Abstract”
  2. In this paper, we study and experiment with novel methods that apply translation models into domain-relevant data selection.
    Page 1, “Abstract”
  3. The corpora are necessary priori knowledge for training effective translation model .
    Page 1, “Introduction”
  4. However, domain-specific machine translation has few parallel corpora for translation model training in the domain of interest.
    Page 1, “Introduction”
  5. To overcome the problem, we first propose the method combining translation model with language model in data selection.
    Page 1, “Introduction”
  6. Meanwhile, the translation model measures the translation probability of sentence pair, being used to verify the parallelism of the selected domain-relevant bitext.
    Page 1, “Introduction”
  7. Thus, we propose novel methods which are based on translation model and language model for data selection.
    Page 2, “Related Work”
  8. We present three data selection methods for ranking and selecting domain-relevant sentence pairs from general-domain corpus, with an eye towards improving domain-specific translation model performance.
    Page 2, “Training Data Selection Methods”
  9. These methods are based on language model and translation model , which are trained on small in-domain parallel data.
    Page 2, “Training Data Selection Methods”
  10. 3.1 Data Selection with Translation Model
    Page 2, “Training Data Selection Methods”
  11. Translation model is a key component in statistical machine translation.
    Page 2, “Training Data Selection Methods”

See all papers in Proc. ACL 2014 that mention translation model.

See all papers in Proc. ACL that mention translation model.

Back to top.

in-domain

Appears in 22 sentences as: In-domain (5) in-domain (18)
In Effective Selection of Translation Model Training Data
  1. Most current data selection methods solely use language models trained on a small scale in-domain data to select domain-relevant sentence pairs from general-domain parallel corpus.
    Page 1, “Abstract”
  2. Current data selection methods mostly use language models trained on small scale in-domain data to measure domain relevance and select domain-relevant parallel sentence pairs to expand training corpora.
    Page 1, “Introduction”
  3. (2010) ranked the sentence pairs in the general-domain corpus according to the perplexity scores of sentences, which are computed with respect to in-domain language models.
    Page 1, “Related Work”
  4. These methods are based on language model and translation model, which are trained on small in-domain parallel data.
    Page 2, “Training Data Selection Methods”
  5. t(ej|fi) is the translation probability of word 61- conditioned on word fiand is estimated from the small in-domain parallel data.
    Page 2, “Training Data Selection Methods”
  6. The sentence pair with higher score is more likely to be generated by in-domain translation model, thus, it is more relevant to the in-domain corpus and will be remained to expand the training data.
    Page 2, “Training Data Selection Methods”
  7. Where P(e, f) is a joint probability of sentence 6 and f according to the translation model P(e| f) and language model P( f ), whose parameters are estimated from the small in-domain text.
    Page 2, “Training Data Selection Methods”
  8. The sentence pair with higher score is more similar to in-domain corpus, and will be picked out.
    Page 2, “Training Data Selection Methods”
  9. The in-domain data is collected from CWMT09, which consists of spoken dialogues in a travel setting, containing approximately 50,000 parallel sentence pairs in English and Chinese.
    Page 3, “Experiments”
  10. Bilingual Cor- #sentence #token P113 Eng Chn Eng Chn In-domain 50K 50K 360K 310K General-domain 1 6M 1 6M 3933M 3602M
    Page 3, “Experiments”
  11. Our work relies on the use of in-domain language models and translation models to rank the sentence pairs from the general-domain bilingual training set.
    Page 3, “Experiments”

See all papers in Proc. ACL 2014 that mention in-domain.

See all papers in Proc. ACL that mention in-domain.

Back to top.

translation probability

Appears in 9 sentences as: translation probability (9)
In Effective Selection of Translation Model Training Data
  1. Meanwhile, the translation model measures the translation probability of sentence pair, being used to verify the parallelism of the selected domain-relevant bitext.
    Page 1, “Introduction”
  2. The reason is that a sentence pair contains a source language sentence and a target language sentence, while the existing methods are incapable of evaluating the mutual translation probability of sentence pair in the target domain.
    Page 2, “Related Work”
  3. However, in this paper, we adopt the translation model to evaluate the translation probability of sentence pair and develop a simple but effective variant of translation model to rank the sentence pairs in the general-domain corpus.
    Page 2, “Training Data Selection Methods”
  4. Where P(e| f) is the translation model, which is IBM Model 1 in this paper, it represents the translation probability of target language sentence e conditioned on source language sentence f. le and If are the number of words in sentence 6 and f respectively.
    Page 2, “Training Data Selection Methods”
  5. t(ej|fi) is the translation probability of word 61- conditioned on word fiand is estimated from the small in-domain parallel data.
    Page 2, “Training Data Selection Methods”
  6. As described in section 1, the existing data selection methods which only adopt language model to score sentence pairs are unable to measure the mutual translation probability of sentence pairs.
    Page 2, “Training Data Selection Methods”
  7. However, it does not evaluate the inverse translation probability of sentence pair and the probability of target language sentence.
    Page 2, “Training Data Selection Methods”
  8. Additionally, we adopt GIZA++ to get the word alignment of in-domain parallel data and form the word translation probability table.
    Page 3, “Experiments”
  9. This table will be used to compute the translation probability of general-domain sentence pairs.
    Page 3, “Experiments”

See all papers in Proc. ACL 2014 that mention translation probability.

See all papers in Proc. ACL that mention translation probability.

Back to top.

BLEU

Appears in 8 sentences as: BLEU (8)
In Effective Selection of Translation Model Training Data
  1. When the selected sentence pairs are evaluated on an end-to-end MT task, our methods can increase the translation performance by 3 BLEU points.
    Page 1, “Abstract”
  2. The BLEU scores of the In-domain and General-domain baseline system are listed in Table 2.
    Page 3, “Experiments”
  3. The results show that General-domain system trained on a larger amount of bilingual resources outperforms the system trained on the in-domain corpus by over 12 BLEU points.
    Page 3, “Experiments”
  4. The horizontal coordinate represents the number of selected sentence pairs and vertical coordinate is the BLEU scores of MT systems.
    Page 4, “Experiments”
  5. In the end-to-end SMT evaluation, TM selects top 600k sentence pairs of general-domain corpus, but increases the translation performance by 2.7 BLEU points.
    Page 4, “Experiments”
  6. Meanwhile, the TM+LM and Bidirectional TM+LM have gained 3.66 and 3.56 BLEU point improvements compared against the general-domain baseline system.
    Page 4, “Experiments”
  7. Compared with the mainstream methods (Ngram and Neural net), our methods increase translation performance by nearly 3 BLEU points, when the top 600k sentence pairs are picked out.
    Page 4, “Experiments”
  8. Compared with the methods which only employ language model for data selection, we observe that our methods are able to select high-quality do-main-relevant sentence pairs and improve the translation performance by nearly 3 BLEU points.
    Page 4, “Conclusion”

See all papers in Proc. ACL 2014 that mention BLEU.

See all papers in Proc. ACL that mention BLEU.

Back to top.

machine translation

Appears in 8 sentences as: machine translation (8)
In Effective Selection of Translation Model Training Data
  1. Data selection has been demonstrated to be an effective approach to addressing the lack of high-quality bitext for statistical machine translation in the domain of interest.
    Page 1, “Abstract”
  2. Statistical machine translation depends heavily on large scale parallel corpora.
    Page 1, “Introduction”
  3. However, domain-specific machine translation has few parallel corpora for translation model training in the domain of interest.
    Page 1, “Introduction”
  4. Translation model is a key component in statistical machine translation .
    Page 2, “Training Data Selection Methods”
  5. We use the NiuTrans 2 toolkit which adopts GIZA++ (Och and Ney, 2003) and MERT (Och, 2003) to train and tune the machine translation system.
    Page 3, “Experiments”
  6. This tool scores the outputs in several criterions, while the case-insensitive BLEU-4 (Papineni et al., 2002) is used as the evaluation for the machine translation system.
    Page 3, “Experiments”
  7. When top 600k sentence pairs are picked out from general-domain corpus to train machine translation systems, the systems perform higher than the General-domain baseline trained on 16 million parallel data.
    Page 4, “Experiments”
  8. our methods into domain adaptation task of statistical machine translation in model level.
    Page 5, “Conclusion”

See all papers in Proc. ACL 2014 that mention machine translation.

See all papers in Proc. ACL that mention machine translation.

Back to top.

baseline systems

Appears in 6 sentences as: baseline system (2) Baseline Systems (1) baseline systems (3)
In Effective Selection of Translation Model Training Data
  1. 4.4 Baseline Systems
    Page 3, “Experiments”
  2. As described above, by using the NiuTrans toolkit, we have built two baseline systems to fulfill “863” SLT task in our experiments.
    Page 3, “Experiments”
  3. These two baseline systems are equipped with the same language model which is trained on large-scale monolingual target language corpus.
    Page 3, “Experiments”
  4. The BLEU scores of the In-domain and General-domain baseline system are listed in Table 2.
    Page 3, “Experiments”
  5. Translation performances of In-domain and General-domain baseline systems
    Page 3, “Experiments”
  6. Meanwhile, the TM+LM and Bidirectional TM+LM have gained 3.66 and 3.56 BLEU point improvements compared against the general-domain baseline system .
    Page 4, “Experiments”

See all papers in Proc. ACL 2014 that mention baseline systems.

See all papers in Proc. ACL that mention baseline systems.

Back to top.

BLEU points

Appears in 6 sentences as: BLEU point (1) BLEU points (5)
In Effective Selection of Translation Model Training Data
  1. When the selected sentence pairs are evaluated on an end-to-end MT task, our methods can increase the translation performance by 3 BLEU points .
    Page 1, “Abstract”
  2. The results show that General-domain system trained on a larger amount of bilingual resources outperforms the system trained on the in-domain corpus by over 12 BLEU points .
    Page 3, “Experiments”
  3. In the end-to-end SMT evaluation, TM selects top 600k sentence pairs of general-domain corpus, but increases the translation performance by 2.7 BLEU points .
    Page 4, “Experiments”
  4. Meanwhile, the TM+LM and Bidirectional TM+LM have gained 3.66 and 3.56 BLEU point improvements compared against the general-domain baseline system.
    Page 4, “Experiments”
  5. Compared with the mainstream methods (Ngram and Neural net), our methods increase translation performance by nearly 3 BLEU points , when the top 600k sentence pairs are picked out.
    Page 4, “Experiments”
  6. Compared with the methods which only employ language model for data selection, we observe that our methods are able to select high-quality do-main-relevant sentence pairs and improve the translation performance by nearly 3 BLEU points .
    Page 4, “Conclusion”

See all papers in Proc. ACL 2014 that mention BLEU points.

See all papers in Proc. ACL that mention BLEU points.

Back to top.

model training

Appears in 5 sentences as: model training (3) models trained (2)
In Effective Selection of Translation Model Training Data
  1. Most current data selection methods solely use language models trained on a small scale in-domain data to select domain-relevant sentence pairs from general-domain parallel corpus.
    Page 1, “Abstract”
  2. However, domain-specific machine translation has few parallel corpora for translation model training in the domain of interest.
    Page 1, “Introduction”
  3. Current data selection methods mostly use language models trained on small scale in-domain data to measure domain relevance and select domain-relevant parallel sentence pairs to expand training corpora.
    Page 1, “Introduction”
  4. Meanwhile, we use the language model training scripts integrated in the NiuTrans toolkit to train another 4-gram language model, which is used in MT tuning and decoding.
    Page 3, “Experiments”
  5. We present three novel methods for translation model training data selection, which are based on the translation model and language model.
    Page 4, “Conclusion”

See all papers in Proc. ACL 2014 that mention model training.

See all papers in Proc. ACL that mention model training.

Back to top.

parallel data

Appears in 5 sentences as: parallel data (5)
In Effective Selection of Translation Model Training Data
  1. These methods are based on language model and translation model, which are trained on small in-domain parallel data .
    Page 2, “Training Data Selection Methods”
  2. t(ej|fi) is the translation probability of word 61- conditioned on word fiand is estimated from the small in-domain parallel data .
    Page 2, “Training Data Selection Methods”
  3. Additionally, we adopt GIZA++ to get the word alignment of in-domain parallel data and form the word translation probability table.
    Page 3, “Experiments”
  4. We adopt five methods for extracting domain-relevant parallel data from general-domain corpus.
    Page 4, “Experiments”
  5. When top 600k sentence pairs are picked out from general-domain corpus to train machine translation systems, the systems perform higher than the General-domain baseline trained on 16 million parallel data .
    Page 4, “Experiments”

See all papers in Proc. ACL 2014 that mention parallel data.

See all papers in Proc. ACL that mention parallel data.

Back to top.

parallel corpus

Appears in 4 sentences as: parallel corpus (4)
In Effective Selection of Translation Model Training Data
  1. Most current data selection methods solely use language models trained on a small scale in-domain data to select domain-relevant sentence pairs from general-domain parallel corpus .
    Page 1, “Abstract”
  2. For this, an effective approach is to automatically select and eXpand domain-specific sentence pairs from large scale general-domain parallel corpus .
    Page 1, “Introduction”
  3. The reason is that large scale parallel corpus maintains more bilingual knowledge and language phenomenon, while small in-domain corpus encounters data sparse problem, which degrades the translation performance.
    Page 3, “Experiments”
  4. Results of the systems trained on only a subset of the general-domain parallel corpus .
    Page 4, “Experiments”

See all papers in Proc. ACL 2014 that mention parallel corpus.

See all papers in Proc. ACL that mention parallel corpus.

Back to top.

statistical machine translation

Appears in 4 sentences as: Statistical machine translation (1) statistical machine translation (3)
In Effective Selection of Translation Model Training Data
  1. Data selection has been demonstrated to be an effective approach to addressing the lack of high-quality bitext for statistical machine translation in the domain of interest.
    Page 1, “Abstract”
  2. Statistical machine translation depends heavily on large scale parallel corpora.
    Page 1, “Introduction”
  3. Translation model is a key component in statistical machine translation .
    Page 2, “Training Data Selection Methods”
  4. our methods into domain adaptation task of statistical machine translation in model level.
    Page 5, “Conclusion”

See all papers in Proc. ACL 2014 that mention statistical machine translation.

See all papers in Proc. ACL that mention statistical machine translation.

Back to top.

translation system

Appears in 3 sentences as: translation system (2) translation systems (1)
In Effective Selection of Translation Model Training Data
  1. We use the NiuTrans 2 toolkit which adopts GIZA++ (Och and Ney, 2003) and MERT (Och, 2003) to train and tune the machine translation system .
    Page 3, “Experiments”
  2. This tool scores the outputs in several criterions, while the case-insensitive BLEU-4 (Papineni et al., 2002) is used as the evaluation for the machine translation system .
    Page 3, “Experiments”
  3. When top 600k sentence pairs are picked out from general-domain corpus to train machine translation systems , the systems perform higher than the General-domain baseline trained on 16 million parallel data.
    Page 4, “Experiments”

See all papers in Proc. ACL 2014 that mention translation system.

See all papers in Proc. ACL that mention translation system.

Back to top.