A Multi-Domain Translation Model Framework for Statistical Machine Translation
Sennrich, Rico and Schwenk, Holger and Aransa, Walid

Article Structure

Abstract

While domain adaptation techniques for SMT have proven to be effective at improving translation quality, their practicality for a multi-domain environment is often limited because of the computational and human costs of developing and maintaining multiple systems adapted to different domains.

Introduction

The effectiveness of domain adaptation approaches such as mixture-modeling (Foster and Kuhn, 2007) has been established, and has led to research on a wide array of adaptation techniques in SMT, for instance (Matsoukas et al., 2009; Shah et al., 2012).

Related Work

(Ortiz-Martinez et al., 2010) delay the computation of translation model features for the purpose of interactive machine translation with online training.

Translation Model Architecture

This section covers the architecture of the multi-domain translation model framework.

Topics

translation model

Appears in 25 sentences as: Translation Model (1) translation model (23) translation models (3)
In A Multi-Domain Translation Model Framework for Statistical Machine Translation
  1. We present an architecture that delays the computation of translation model features until decoding, allowing for the application of mixture-modeling techniques at decoding time.
    Page 1, “Abstract”
  2. Experimental results on two language pairs demonstrate the effectiveness of both our translation model architecture and automatic clustering, with gains of up to 1 BLEU over unadapted systems and single-domain adaptation.
    Page 1, “Abstract”
  3. We introduce a translation model architecture that delays the computation of features to the decoding phase.
    Page 1, “Introduction”
  4. (Ortiz-Martinez et al., 2010) delay the computation of translation model features for the purpose of interactive machine translation with online training.
    Page 1, “Related Work”
  5. (Sennrich, 2012b) perform instance weighting of translation models , based on the sufficient statistics.
    Page 2, “Related Work”
  6. (Razmara et al., 2012) describe an ensemble decoding framework which combines several translation models in the decoding step.
    Page 2, “Related Work”
  7. Our work is similar to theirs in that the combination is done at runtime, but we also delay the computation of translation model probabilities, and thus have access to richer sufficient statistics.
    Page 2, “Related Work”
  8. This section covers the architecture of the multi-domain translation model framework.
    Page 2, “Translation Model Architecture”
  9. Our translation model is embedded in a log-linear model as is common for SMT, and treated as a single translation model in this log-linear combination.
    Page 2, “Translation Model Architecture”
  10. The architecture has two goals: move the calculation of translation model features to the decoding phase, and allow for multiple knowledge sources (e. g. bitexts or user-provided data) to contribute to their calculation.
    Page 2, “Translation Model Architecture”
  11. We are concerned with calculating four features during decoding, henceforth just referred to as the translation model features: p(§|f), lex(§|f), p(f|§) and lex(f|§).
    Page 2, “Translation Model Architecture”

See all papers in Proc. ACL 2013 that mention translation model.

See all papers in Proc. ACL that mention translation model.

Back to top.

development set

Appears in 18 sentences as: Development Set (1) development set (15) development sets (4)
In A Multi-Domain Translation Model Framework for Statistical Machine Translation
  1. If there is a mismatch between the domain of the development set and the test set, domain adaptation can potentially harm performance compared to an unadapted baseline.
    Page 1, “Introduction”
  2. As a way of optimizing instance weights, (Sennrich, 2012b) minimize translation model perplexity on a set of phrase pairs, automatically extracted from a parallel development set .
    Page 4, “Translation Model Architecture”
  3. Cluster a development set into k clusters.
    Page 4, “Translation Model Architecture”
  4. 4.1 Clustering the Development Set
    Page 4, “Translation Model Architecture”
  5. We use k-means clustering to cluster the sentences of the development set .
    Page 4, “Translation Model Architecture”
  6. As a result of development set clustering, we obtain a bitext for each cluster, which we use to optimize the model weights, and a centroid per cluster.
    Page 4, “Translation Model Architecture”
  7. Only the clustering of the development set and optimization of the translation model weights for each clusters is affected by k. This means that the approach can in principle be scaled to a high number of clusters, and support a high number of domains.5
    Page 5, “Translation Model Architecture”
  8. We perform a linear interpolation of models for each cluster, with interpolation coefficients optimized using perplexity minimization on the development set .
    Page 5, “Translation Model Architecture”
  9. 5If the development set is labelled, one can also use a gold segmentation of development sets instead of kr-means clustering.
    Page 5, “Translation Model Architecture”
  10. The development sets are random samples from the respective in-domain bitexts (held-out from training).
    Page 6, “Translation Model Architecture”
  11. We vary the number of clusters k from 1, which corresponds to adapting the models to the full development set , to 16.
    Page 6, “Translation Model Architecture”

See all papers in Proc. ACL 2013 that mention development set.

See all papers in Proc. ACL that mention development set.

Back to top.

weight vector

Appears in 12 sentences as: weight vector (6) Weight vectors (1) weight vectors (5)
In A Multi-Domain Translation Model Framework for Statistical Machine Translation
  1. With this framework, adaptation to a new domain simply consists of updating a weight vector , and multiple domains can be supported by the same system.
    Page 1, “Introduction”
  2. For each sentence that is being decoded, we choose the weight vector that is optimized on the closest cluster, allowing for adaptation even with unlabelled and heterogeneous test data.
    Page 1, “Introduction”
  3. To combine statistics from a vector of n component corpora, we can use a weighted version of equation 1, which adds a weight vector A of length n (Sennrich, 2012b):
    Page 2, “Translation Model Architecture”
  4. Table 1: Illustration of instance weighting with weight vectors for two corpora.
    Page 3, “Translation Model Architecture”
  5. In our implementation, the weight vector is set globally, but can be overridden on a per-sentence basis.
    Page 4, “Translation Model Architecture”
  6. In principle, using different weight vectors for different phrase pairs in a sentence is conceivable.
    Page 4, “Translation Model Architecture”
  7. The framework supports decoding each sentence with a separate weight vector of size 4n, 4 being the number of translation model features whose computation can be weighted, and n the number of model components.
    Page 4, “Translation Model Architecture”
  8. We follow this technique, but want to have multiple weight vectors , adapted to different texts, between which the system switches at decoding time.
    Page 4, “Translation Model Architecture”
  9. Table 5: Weight vectors for feature p(f|§) optimized on four development sets (from gold split and clustering with k = 2).
    Page 6, “Translation Model Architecture”
  10. Table 5 shows the automatically obtained translation model weight vectors for two systems, “gold clusters” and “2 clusters”, for the feature p(f|§).
    Page 7, “Translation Model Architecture”
  11. We have also described a usage scenario for this architecture, namely its ability to quickly switch between weight vectors in order to serve as an adapted model for multiple domains.
    Page 7, “Translation Model Architecture”

See all papers in Proc. ACL 2013 that mention weight vector.

See all papers in Proc. ACL that mention weight vector.

Back to top.

domain adaptation

Appears in 9 sentences as: domain adaptation (9)
In A Multi-Domain Translation Model Framework for Statistical Machine Translation
  1. While domain adaptation techniques for SMT have proven to be effective at improving translation quality, their practicality for a multi-domain environment is often limited because of the computational and human costs of developing and maintaining multiple systems adapted to different domains.
    Page 1, “Abstract”
  2. The effectiveness of domain adaptation approaches such as mixture-modeling (Foster and Kuhn, 2007) has been established, and has led to research on a wide array of adaptation techniques in SMT, for instance (Matsoukas et al., 2009; Shah et al., 2012).
    Page 1, “Introduction”
  3. Therefore, when working with multiple and/or unlabelled domains, domain adaptation is often impractical for a number of reasons.
    Page 1, “Introduction”
  4. Secondly, domain adaptation bears a risk of performance loss.
    Page 1, “Introduction”
  5. If there is a mismatch between the domain of the development set and the test set, domain adaptation can potentially harm performance compared to an unadapted baseline.
    Page 1, “Introduction”
  6. Our immediate purpose for this paper is domain adaptation in a multi-domain environment, but the delay of the feature computation has other potential applications, e.g.
    Page 2, “Translation Model Architecture”
  7. The goal is to perform domain adaptation without requiring domain labels or user input, neither for development nor decoding.
    Page 4, “Translation Model Architecture”
  8. Our theoretical expectation is that domain adaptation will fail to perform well if the test data is from
    Page 4, “Translation Model Architecture”
  9. Secondly, about one third of the IT test set is assigned to a cluster that is not IT-specific, which weakens the effect of domain adaptation for the systems with 16 clusters.
    Page 7, “Translation Model Architecture”

See all papers in Proc. ACL 2013 that mention domain adaptation.

See all papers in Proc. ACL that mention domain adaptation.

Back to top.

language model

Appears in 9 sentences as: language model (12) language models (2)
In A Multi-Domain Translation Model Framework for Statistical Machine Translation
  1. We train a language model on the source language side of each of the n component bitexts, and compute an n-dimensional vector for each sentence by computing its entropy with each language model .
    Page 4, “Translation Model Architecture”
  2. Our aim is not to discriminate between sentences that are more likely and unlikely in general, but to cluster on the basis of relative differences between the language model entropies.
    Page 4, “Translation Model Architecture”
  3. While it is not the focus of this paper, we also evaluate language model adaptation.
    Page 5, “Translation Model Architecture”
  4. The cost of moving language model interpolation into the decoding phase is far greater than for translation models, since the number of hypotheses that need to be evaluated by the language model is several orders of magnitudes higher than the number of phrase pairs used during the translation.
    Page 5, “Translation Model Architecture”
  5. For the experiments with language model adaptation, we have chosen to perform linear interpolation offline, and perform language model switching during decoding.
    Page 5, “Translation Model Architecture”
  6. While model switching is a fast operation, it also makes the space complexity of storing the language models linear to the number of clusters.
    Page 5, “Translation Model Architecture”
  7. pass decoding, with an unadapted language model in the first phase, and rescoring with a language model adapted online, could perform adequately, and keep the complexity independent of the number of clusters.
    Page 5, “Translation Model Architecture”
  8. For both data sets, language models are trained on the target side of the bitexts.
    Page 6, “Translation Model Architecture”
  9. The fact that language model adaptation yields an additional improvement in our experiments suggests that it it would be worthwhile to also investigate a language model data structure that efficiently supports multiple domains.
    Page 7, “Translation Model Architecture”

See all papers in Proc. ACL 2013 that mention language model.

See all papers in Proc. ACL that mention language model.

Back to top.

BLEU

Appears in 8 sentences as: BLEU (13)
In A Multi-Domain Translation Model Framework for Statistical Machine Translation
  1. Experimental results on two language pairs demonstrate the effectiveness of both our translation model architecture and automatic clustering, with gains of up to 1 BLEU over unadapted systems and single-domain adaptation.
    Page 1, “Abstract”
  2. We found that this had no significant effects on BLEU .
    Page 3, “Translation Model Architecture”
  3. We report translation quality using BLEU (Papineni et
    Page 5, “Translation Model Architecture”
  4. For the IT test set, the system with gold labels and TM adaptation yields an improvement of 0.7 BLEU (21.1 —> 21.8), LM adaptation yields 1.3 BLEU (21.1 —> 22.4), and adapting both models outperforms the baseline by 2.1 BLEU (21.1 —> 23.2).
    Page 7, “Translation Model Architecture”
  5. Firstly, for the system with adapted TM, one of the three MERT runs is an outlier, and the reported BLEU score of 21.1 is averaged from the three MERT runs achieving 22.1, 21.6, and 19.6 BLEU , respectively.
    Page 7, “Translation Model Architecture”
  6. For the LEGAL domain, the weights are more uniform, which is congruent with our observation that BLEU changes little.
    Page 7, “Translation Model Architecture”
  7. For the system with 16 clusters, we observe an improvement of 0.3 BLEU for TM adaptation, and 0.6 BLEU for adapting both models (34.4 —> 34.7 —> 35.0).
    Page 7, “Translation Model Architecture”
  8. We observe gains of 0.6 BLEU (34.4 —> 35.0) for TM or LM adaptation, and 1 BLEU (34.4 —> 35.4) when both models are adapted.
    Page 7, “Translation Model Architecture”

See all papers in Proc. ACL 2013 that mention BLEU.

See all papers in Proc. ACL that mention BLEU.

Back to top.

phrase pairs

Appears in 6 sentences as: phrase pair (1) phrase pairs (6)
In A Multi-Domain Translation Model Framework for Statistical Machine Translation
  1. For phrase pairs which are not found, C(E, f) and C(f) are initially set to 0.
    Page 3, “Translation Model Architecture”
  2. Note that C(E) is potentially incorrect at this point, since a phrase pair not being found does not entail that C(E) is 0.
    Page 3, “Translation Model Architecture”
  3. 4We prune the tables to the most frequent 50 phrase pairs per source phrase before combining them, since calculating the features for all phrase pairs of very common source phrases causes a significant slowdown.
    Page 3, “Translation Model Architecture”
  4. In principle, using different weight vectors for different phrase pairs in a sentence is conceivable.
    Page 4, “Translation Model Architecture”
  5. As a way of optimizing instance weights, (Sennrich, 2012b) minimize translation model perplexity on a set of phrase pairs , automatically extracted from a parallel development set.
    Page 4, “Translation Model Architecture”
  6. The cost of moving language model interpolation into the decoding phase is far greater than for translation models, since the number of hypotheses that need to be evaluated by the language model is several orders of magnitudes higher than the number of phrase pairs used during the translation.
    Page 5, “Translation Model Architecture”

See all papers in Proc. ACL 2013 that mention phrase pairs.

See all papers in Proc. ACL that mention phrase pairs.

Back to top.

LM

Appears in 4 sentences as: LM (4)
In A Multi-Domain Translation Model Framework for Statistical Machine Translation
  1. We find that an adaptation of the TM and LM to the full development set (system “1 cluster”) yields the smallest improvements over the unadapted baseline.
    Page 7, “Translation Model Architecture”
  2. For the IT test set, the system with gold labels and TM adaptation yields an improvement of 0.7 BLEU (21.1 —> 21.8), LM adaptation yields 1.3 BLEU (21.1 —> 22.4), and adapting both models outperforms the baseline by 2.1 BLEU (21.1 —> 23.2).
    Page 7, “Translation Model Architecture”
  3. TM adaptation with 8 clusters (21.1 —> 21.8 —> 22.1), or LM adaptation with 4 or 8 clusters (21.1 —> 22.4 —> 23.1).
    Page 7, “Translation Model Architecture”
  4. We observe gains of 0.6 BLEU (34.4 —> 35.0) for TM or LM adaptation, and 1 BLEU (34.4 —> 35.4) when both models are adapted.
    Page 7, “Translation Model Architecture”

See all papers in Proc. ACL 2013 that mention LM.

See all papers in Proc. ACL that mention LM.

Back to top.

log-linear

Appears in 4 sentences as: Log-linear (1) log-linear (4)
In A Multi-Domain Translation Model Framework for Statistical Machine Translation
  1. Our translation model is embedded in a log-linear model as is common for SMT, and treated as a single translation model in this log-linear combination.
    Page 2, “Translation Model Architecture”
  2. Log-linear weights are optimized using MERT (Och and Ney, 2003).
    Page 5, “Translation Model Architecture”
  3. Future work could involve merging our translation model framework with the online adaptation of other models, or the log-linear weights.
    Page 8, “Translation Model Architecture”
  4. Our approach is orthogonal to that of (Clark et al., 2012), who perform feature augmentation to obtain multiple sets of adapted log-linear weights.
    Page 8, “Translation Model Architecture”

See all papers in Proc. ACL 2013 that mention log-linear.

See all papers in Proc. ACL that mention log-linear.

Back to top.

word alignment

Appears in 4 sentences as: word alignment (3) word alignments (1)
In A Multi-Domain Translation Model Framework for Statistical Machine Translation
  1. The lexical weights lex(§|f) and lex(f|§) are calculated as follows, using a set of word alignments a between E and El
    Page 2, “Translation Model Architecture”
  2. 30(3, t) and 005, s) are not identical since the lexical probabilities are based on the unsymmetrized word alignment frequencies (in the Moses implementation which we re-implement).
    Page 3, “Translation Model Architecture”
  3. In the unweighted variant, the resulting features are equivalent to training on the concatenation of all training data, excepting differences in word alignment , pruning4 and rounding.
    Page 3, “Translation Model Architecture”
  4. We keep the word alignment and lexical reordering models constant through the experiments to minimize the number of confounding factors.
    Page 5, “Translation Model Architecture”

See all papers in Proc. ACL 2013 that mention word alignment.

See all papers in Proc. ACL that mention word alignment.

Back to top.

machine translation

Appears in 3 sentences as: Machine Translation (1) machine translation (2)
In A Multi-Domain Translation Model Framework for Statistical Machine Translation
  1. (Ortiz-Martinez et al., 2010) delay the computation of translation model features for the purpose of interactive machine translation with online training.
    Page 1, “Related Work”
  2. One applications where this could be desirable is interactive machine translation , where one could work with a mix of compact, static tables, and tables designed to be incrementally trainable.
    Page 3, “Translation Model Architecture”
  3. 2 data sets are out-of-domain, made available by the 2012 Workshop on Statistical Machine Translation (Callison-Burch et al., 2012).
    Page 6, “Translation Model Architecture”

See all papers in Proc. ACL 2013 that mention machine translation.

See all papers in Proc. ACL that mention machine translation.

Back to top.

translation probabilities

Appears in 3 sentences as: translation probabilities (3)
In A Multi-Domain Translation Model Framework for Statistical Machine Translation
  1. In principle, our architecture can support all mixture operations that (Razmara et al., 2012) describe, plus additional ones such as forms of instance weighting, which are not possible after the translation probabilities have been computed.
    Page 2, “Related Work”
  2. Traditionally, the phrase translation probabilities p(§|f) and p(f|§) are estimated through un-smoothed maximum likelihood estimation (MLE).
    Page 2, “Translation Model Architecture”
  3. The word translation probabilities w (t,- | 33) are de-
    Page 2, “Translation Model Architecture”

See all papers in Proc. ACL 2013 that mention translation probabilities.

See all papers in Proc. ACL that mention translation probabilities.

Back to top.

translation systems

Appears in 3 sentences as: translation system (1) translation systems (2)
In A Multi-Domain Translation Model Framework for Statistical Machine Translation
  1. They use separate translation systems for each domain, and a supervised setting, whereas we aim for a system that integrates support for multiple domains, with or without supervision.
    Page 2, “Related Work”
  2. ment a multi-domain translation system .
    Page 8, “Translation Model Architecture”
  3. The translation model framework could also serve as the basis of real-time adaptation of translation systems , e. g. by using incremental means to update the weight vector, or having an incrementally trainable component model that learns from the post-edits by the user, and is assigned a suitable weight.
    Page 8, “Translation Model Architecture”

See all papers in Proc. ACL 2013 that mention translation systems.

See all papers in Proc. ACL that mention translation systems.

Back to top.