Vector Space Model for Adaptation in Statistical Machine Translation
Chen, Boxing and Kuhn, Roland and Foster, George

Article Structure

Abstract

This paper proposes a new approach to domain adaptation in statistical machine translation (SMT) based on a vector space model (VSM).

Introduction

The translation models of a statistical machine translation (SMT) system are trained on parallel data.

Vector space model adaptation

Vector space models (VSMs) have been widely applied in many information retrieval and natural language processing applications.

Experiments

3.1 Data setting

Topics

phrase pair

Appears in 20 sentences as: phrase pair (12) phrase pairs (9) phrase pair’s (2)
In Vector Space Model for Adaptation in Statistical Machine Translation
  1. This profile might, for instance, be a vector with a dimensionality equal to the number of training subcorpora; each entry in the vector reflects the contribution of a particular subcorpus to all the phrase pairs that can be extracted from the dev set.
    Page 1, “Abstract”
  2. Then, for each phrase pair extracted from the training data, we create a vector with features defined in the same way, and calculate its similarity score with the vector representing the dev set.
    Page 1, “Abstract”
  3. Thus, we obtain a decoding feature whose value represents the phrase pair’s closeness to the dev.
    Page 1, “Abstract”
  4. This is a simple, computationally cheap form of instance weighting for phrase pairs .
    Page 1, “Abstract”
  5. typically use a rich feature set to decide on weights for the training data, at the sentence or phrase pair level.
    Page 2, “Introduction”
  6. As in (Foster et al., 2010), this approach works at the level of phrase pairs .
    Page 2, “Introduction”
  7. Instead of using word-based features and a computationally expensive training procedure, we capture the distributional properties of each phrase pair directly, representing it as a vector in a space which also contains a representation of the dev set.
    Page 2, “Introduction”
  8. The similarity between a given phrase pair’s vector and the dev set vector becomes a feature for the decoder.
    Page 2, “Introduction”
  9. It rewards phrase pairs that are in some sense closer to those found in the dev set, and punishes the rest.
    Page 2, “Introduction”
  10. In the experiments described below, we chose a definition that measures the contribution (to counts of a given phrase pair, or to counts of all phrase pairs in the dev set) of each training subcorpus.
    Page 2, “Introduction”
  11. Then, treating the dev set and each phrase pair as a pair of bags of words (a source bag and a target bag) one could represent each as a vector of dimension S + T, with entries calculated from the counts associated with the S + T clusters (in a way similar to that described for phrase pairs below).
    Page 2, “Introduction”

See all papers in Proc. ACL 2013 that mention phrase pair.

See all papers in Proc. ACL that mention phrase pair.

Back to top.

vector space

Appears in 11 sentences as: Vector space (1) vector space (10)
In Vector Space Model for Adaptation in Statistical Machine Translation
  1. This paper proposes a new approach to domain adaptation in statistical machine translation (SMT) based on a vector space model (VSM).
    Page 1, “Abstract”
  2. In this paper, we propose a new instance weighting approach to domain adaptation based on a vector space model (VSM).
    Page 2, “Introduction”
  3. The vector space used by VSM adaptation can be defined in various ways.
    Page 2, “Introduction”
  4. More fundamentally, there is nothing about the VSM idea that obliges us to define the vector space in terms of subcorpora.
    Page 2, “Introduction”
  5. One can think of several other ways of defining the vector space that might yield even better results than those reported here.
    Page 2, “Introduction”
  6. Vector space models (VSMs) have been widely applied in many information retrieval and natural language processing applications.
    Page 2, “Vector space model adaptation”
  7. Therefore, even within the variant of VSM adaptation we focus on in this paper, where the definition of the vector space is based on the existence of subcorpora, one could utilize other definitions of the vectors of the similarity function than those we utilized in our experiments.
    Page 3, “Vector space model adaptation”
  8. In our experiments, we based the vector space on subcorpora defined by the nature of the training data.
    Page 8, “Experiments”
  9. This was done purely out of convenience: there are many, many ways to define a vector space in this situation.
    Page 8, “Experiments”
  10. An obvious and appealing one, which we intend to try in future, is a vector space based on a bag-of-words topic model.
    Page 8, “Experiments”
  11. A feature derived from this topic-related vector space might complement some features derived from the subcorpora which we explored in the experiments above, and which seem to exploit information related to genre and style.
    Page 8, “Experiments”

See all papers in Proc. ACL 2013 that mention vector space.

See all papers in Proc. ACL that mention vector space.

Back to top.

NIST

Appears in 6 sentences as: NIST (7)
In Vector Space Model for Adaptation in Statistical Machine Translation
  1. Experiments on large scale NIST evaluation data show improvements over strong baselines: +1.8 BLEU on Arabic to English and +1.4 BLEU on Chinese to English over a non-adapted baseline, and significant improvements in most circumstances over baselines with linear mixture model adaptation.
    Page 1, “Abstract”
  2. Table 1: NIST Chinese-English data.
    Page 4, “Vector space model adaptation”
  3. We carried out experiments in two different settings, both involving data from NIST Open MT 2012.2 The first setting is based on data from the Chinese to English constrained track, comprising about 283 million English running words.
    Page 4, “Experiments”
  4. The development set (tune) was taken from the NIST 2005 evaluation set, augmented with some web-genre material reserved from other NIST corpora.
    Page 4, “Experiments”
  5. Table 2: NIST Arabic-English data.
    Page 4, “Experiments”
  6. We use the evaluation sets from NIST 2006, 2008, and 2009 as our development set and two test sets, respectively.
    Page 4, “Experiments”

See all papers in Proc. ACL 2013 that mention NIST.

See all papers in Proc. ACL that mention NIST.

Back to top.

domain adaptation

Appears in 5 sentences as: Domain Adaptation (1) Domain adaptation (1) domain adaptation (3)
In Vector Space Model for Adaptation in Statistical Machine Translation
  1. This paper proposes a new approach to domain adaptation in statistical machine translation (SMT) based on a vector space model (VSM).
    Page 1, “Abstract”
  2. Domain adaptation is an active topic in the natural language processing (NLP) research community.
    Page 1, “Introduction”
  3. The 2012 JHU workshop on Domain Adaptation for MT 1 proposed phrase sense disambiguation (PSD) for translation model adaptation.
    Page 2, “Introduction”
  4. In this paper, we propose a new instance weighting approach to domain adaptation based on a vector space model (VSM).
    Page 2, “Introduction”
  5. Thus, the variant of VSM adaptation tested here bears a superficial resemblance to domain adaptation based on mixture models for TMs, as in (Foster and Kuhn, 2007), in that both approaches rely on information about the subcorpora from which the data originate.
    Page 2, “Introduction”

See all papers in Proc. ACL 2013 that mention domain adaptation.

See all papers in Proc. ACL that mention domain adaptation.

Back to top.

in-domain

Appears in 5 sentences as: in-domain (5)
In Vector Space Model for Adaptation in Statistical Machine Translation
  1. The general idea is first to create a vector profile for the in-domain development (“dev”) set.
    Page 1, “Abstract”
  2. In transductive learning, an MT system trained on general domain data is used to translate in-domain monolingual data.
    Page 1, “Introduction”
  3. Data selection approaches (Zhao et al., 2004; Hildebrand et al., 2005; Lu et al., 2007; Moore and Lewis, 2010; Axelrod et al., 2011) search for bilingual sentence pairs that are similar to the in-domain “dev” data, then add them to the training data.
    Page 1, “Introduction”
  4. For the in-domain dev set, we first run word alignment and phrases extracting in the usual way for the dev set, then sum the distribution of each phrase pair (fj, 6k) extracted from the dev data across subcorpora to represent its domain information.
    Page 3, “Vector space model adaptation”
  5. The other is a linear combination of TMs trained on each subcorpus, with the weights of each model learned with an EM algorithm to maximize the likelihood of joint empirical phrase pair counts for in-domain deV data.
    Page 5, “Experiments”

See all papers in Proc. ACL 2013 that mention in-domain.

See all papers in Proc. ACL that mention in-domain.

Back to top.

LM

Appears in 5 sentences as: LM (6)
In Vector Space Model for Adaptation in Statistical Machine Translation
  1. Both were studied in (Foster and Kuhn, 2007), which concluded that the best approach was to combine sub-models of the same type (for instance, several different TMs or several different LMs) linearly, while combining models of different types (for instance, a mixture TM with a mixture LM ) log-linearly.
    Page 1, “Introduction”
  2. Other features include lexical weighting in both directions, word count, a distance-based RM, a 4-gram LM trained on the target side of the parallel data, and a 6-gram English Gigaword LM .
    Page 4, “Experiments”
  3. Some of the results reported above involved linear TM mixtures, but none of them involved linear LM mixtures.
    Page 6, “Experiments”
  4. For instance, with an initial Chinese system that employs linear mixture LM adaptation (lin-lm) and has a BLEU of 32.1, adding l-feature VSM adaptation (+vsm, joint) improves performance to 33.1 (improvement significant at p < 0.01), while adding 3-feature VSM instead (+vsm, 3 feat.)
    Page 6, “Experiments”
  5. For Arabic, including either form of VSM adaptation always improves performance with significance at p < 0.01, even over a system including both linear TM and linear LM adaptation.
    Page 6, “Experiments”

See all papers in Proc. ACL 2013 that mention LM.

See all papers in Proc. ACL that mention LM.

Back to top.

BLEU

Appears in 4 sentences as: BLEU (6)
In Vector Space Model for Adaptation in Statistical Machine Translation
  1. Experiments on large scale NIST evaluation data show improvements over strong baselines: +1.8 BLEU on Arabic to English and +1.4 BLEU on Chinese to English over a non-adapted baseline, and significant improvements in most circumstances over baselines with linear mixture model adaptation.
    Page 1, “Abstract”
  2. The 3-feature version of VSM yields +1.8 BLEU over the baseline for Arabic to English, and +1.4 BLEU for Chinese to English.
    Page 6, “Experiments”
  3. For instance, with an initial Chinese system that employs linear mixture LM adaptation (lin-lm) and has a BLEU of 32.1, adding l-feature VSM adaptation (+vsm, joint) improves performance to 33.1 (improvement significant at p < 0.01), while adding 3-feature VSM instead (+vsm, 3 feat.)
    Page 6, “Experiments”
  4. To get an intuition for how VSM adaptation improves BLEU scores, we compared outputs from the baseline and VSM-adapted system (“vsm, joint” in Table 5) on the Chinese test data.
    Page 6, “Experiments”

See all papers in Proc. ACL 2013 that mention BLEU.

See all papers in Proc. ACL that mention BLEU.

Back to top.

translation model

Appears in 4 sentences as: translation model (2) translation models (2)
In Vector Space Model for Adaptation in Statistical Machine Translation
  1. The translation models of a statistical machine translation (SMT) system are trained on parallel data.
    Page 1, “Introduction”
  2. The 2012 JHU workshop on Domain Adaptation for MT 1 proposed phrase sense disambiguation (PSD) for translation model adaptation.
    Page 2, “Introduction”
  3. The translation model (TM) was smoothed in both directions with KN smoothing (Chen et al., 2011).
    Page 4, “Experiments”
  4. In (Foster and Kuhn, 2007), two kinds of linear mixture were described: linear mixture of language models (LMs), and linear mixture of translation models (TMs).
    Page 6, “Experiments”

See all papers in Proc. ACL 2013 that mention translation model.

See all papers in Proc. ACL that mention translation model.

Back to top.

log-linear

Appears in 3 sentences as: log-linear (3)
In Vector Space Model for Adaptation in Statistical Machine Translation
  1. Research on mixture models has considered both linear and log-linear mixtures.
    Page 1, “Introduction”
  2. (Koehn and Schroeder, 2007), instead, opted for combining the sub-models directly in the SMT log-linear framework.
    Page 1, “Introduction”
  3. One is the log-linear combination of TMs trained on each subcorpus (Koehn and Schroeder, 2007), with weights of each model tuned under minimal error rate training using MIRA.
    Page 5, “Experiments”

See all papers in Proc. ACL 2013 that mention log-linear.

See all papers in Proc. ACL that mention log-linear.

Back to top.

probability distributions

Appears in 3 sentences as: probability distribution (1) probability distributions (2)
In Vector Space Model for Adaptation in Statistical Machine Translation
  1. Thus, we get the probability distribution of a phrase pair or the phrase pairs in the dev data across all subcorpora:
    Page 3, “Vector space model adaptation”
  2. To further improve the similarity score, we apply absolute discounting smoothing when calculating the probability distributions p,( f, e).
    Page 3, “Vector space model adaptation”
  3. We carry out the same smoothing for the probability distributions pi(dev).
    Page 3, “Vector space model adaptation”

See all papers in Proc. ACL 2013 that mention probability distributions.

See all papers in Proc. ACL that mention probability distributions.

Back to top.

sentence pairs

Appears in 3 sentences as: sentence pairs (3)
In Vector Space Model for Adaptation in Statistical Machine Translation
  1. The resulting bilingual sentence pairs are then used as additional training data (Ueffing et al., 2007; Chen et al., 2008; Schwenk, 2008; Bertoldi and Federico, 2009).
    Page 1, “Introduction”
  2. Data selection approaches (Zhao et al., 2004; Hildebrand et al., 2005; Lu et al., 2007; Moore and Lewis, 2010; Axelrod et al., 2011) search for bilingual sentence pairs that are similar to the in-domain “dev” data, then add them to the training data.
    Page 1, “Introduction”
  3. Most training subcorpora consist of parallel sentence pairs .
    Page 4, “Experiments”

See all papers in Proc. ACL 2013 that mention sentence pairs.

See all papers in Proc. ACL that mention sentence pairs.

Back to top.

similarity score

Appears in 3 sentences as: similarity score (3)
In Vector Space Model for Adaptation in Statistical Machine Translation
  1. Then, for each phrase pair extracted from the training data, we create a vector with features defined in the same way, and calculate its similarity score with the vector representing the dev set.
    Page 1, “Abstract”
  2. VSM uses the similarity score between the vec-
    Page 3, “Vector space model adaptation”
  3. To further improve the similarity score , we apply absolute discounting smoothing when calculating the probability distributions p,( f, e).
    Page 3, “Vector space model adaptation”

See all papers in Proc. ACL 2013 that mention similarity score.

See all papers in Proc. ACL that mention similarity score.

Back to top.