Abstract | This paper proposes a new discriminative training method in constructing phrase and lexicon translation models . |
Abstract | parameters in the phrase and lexicon translation models are estimated by relative frequency or maximizing joint likelihood, which may not correspond closely to the translation measure, e.g., bilingual evaluation understudy (BLEU) (Papineni et al., 2002). |
Abstract | However, the number of parameters in common phrase and lexicon translation models is much larger. |
Introduction | In this work, we attempt to learn statistical translation models from only monolingual data in the source and target language. |
Introduction | This work is a big step towards large-scale and large-vocabulary unsupervised training of statistical translation models . |
Introduction | In this work, we will develop, describe, and evaluate methods for large vocabulary unsupervised learning of machine translation models suitable for real-world tasks. |
Related Work | Their best performing approach uses an EM-Algorithm to train a generative word based translation model . |
Translation Model | In this section, we describe the statistical training criterion and the translation model that is trained using monolingual data. |
Translation Model | As training criterion for the translation model’s parameters 6, Ravi and Knight (2011) suggest |
Translation Model | This becomes increasingly difficult with more complex translation models . |
Abstract | Statistical machine translation is often faced with the problem of combining training data from many diverse sources into a single translation model which then has to translate sentences in a new domain. |
Baselines | Log-linear translation model (TM) mixtures are of the form: |
Ensemble Decoding | Given a number of translation models which are already trained and tuned, the ensemble decoder uses hypotheses constructed from all of the models in order to translate a sentence. |
Introduction | Common techniques for model adaptation adapt two main components of contemporary state-of-the-art SMT systems: the language model and the translation model . |
Introduction | translation model adaptation, because various measures such as perplexity of adapted language models can be easily computed on data in the target domain. |
Introduction | It is also easier to obtain monolingual data in the target domain, compared to bilingual data which is required for translation model adaptation. |
Abstract | In this paper, we propose two discriminative, feature-based models to exploit predicate-argument structures for statistical machine translation: 1) a predicate translation model and 2) an argument reordering model. |
Abstract | The predicate translation model explores lexical and semantic contexts surrounding a verbal predicate to select desirable translations for the predicate. |
Introduction | This suggests that conventional leXical and phrasal translation models adopted in those SMT systems are not sufficient to correctly translate predicates in source sentences. |
Introduction | Thus we propose a discriminative, feature-based predicate translation model that captures not only leXical information (i.e., surrounding words) but also high-level semantic contexts to correctly translate predicates. |
Introduction | In Section 3 and 4, we will elaborate the proposed predicate translation model and argument reordering model respectively, including details about modeling, features and training procedure. |
Predicate Translation Model | In this section, we present the features and the training process of the predicate translation model . |
Predicate Translation Model | Following the context-dependent word models in (Berger et al., 1996), we propose a discriminative predicate translation model . |
Predicate Translation Model | Given a source sentence which contains N verbal predicates , our predicate translation model Mt can be denoted as |
Related Work | Our predicate translation model is also related to previous discriminative leXicon translation models (Berger et al., 1996; Venkatapathy and Bangalore, 2007; Mauser et al., 2009). |
Related Work | This will tremendously reduce the amount of training data required, which usually is a problem in discriminative leXicon translation models (Mauser et al., 2009). |
Related Work | Furthermore, the proposed translation model also differs from previous leXicon translation models in that we use both leXical and semantic features. |
A Class-based Model of Agreement | Translation Model 6 Target sequence of I words f Source sequence of J words a Sequence of K phrase alignments for (e, f) H Permutation of the alignments for target word order 6 h Sequence of M feature functions A Sequence of learned weights for the M features H A priority queue of hypotheses |
Discussion of Translation Results | This large gap between the unigram recall of the actual translation output (top) and the lexical coverage of the phrase-based model (bottom) indicates that translation performance can be improved dramatically by altering the translation model through features such as ours, without expanding the search space of the decoder. |
Experiments | We trained the translation model on 502 million words of parallel text collected from a variety of sources, including the Web. |
Inference during Translation Decoding | 3.3 Translation Model Features |
Introduction | However, using lexical coverage experiments, we show that there is ample room for translation quality improvements through better selection of forms that already exist in the translation model . |
Related Work | Factored Translation Models Factored translation models (Koehn and Hoang, 2007) facilitate a more data-oriented approach to agreement modeling. |
Related Work | Subotin (2011) recently extended factored translation models to hierarchical phrase-based translation and developed a discriminative model for predicting target-side morphology in English-Czech. |
Conclusion | We present a head-driven hierarchical phrase-based (HD-HPB) translation model , which adopts head information (derived through unlabeled dependency analysis) in the definition of non-terminals to better differentiate among translation rules. |
Head-Driven HPB Translation Model | Like Chiang (2005) and Chiang (2007), our HD-HPB translation model adopts a synchronous context free grammar, a rewriting system which generates source and target side string pairs simultaneously using a context-free grammar. |
Head-Driven HPB Translation Model | For rule extraction, we first identify initial phrase pairs on word-aligned sentence pairs by using the same criterion as most phrase-based translation models (Och and Ney, 2004) and Chiang’s HPB model (Chiang, 2005; Chiang, 2007). |
Head-Driven HPB Translation Model | Merging two neighboring non-terminals into a single nonterminal, NRRs enable the translation model to explore a wider search space. |
Introduction | Chiang’s hierarchical phrase-based (HPB) translation model utilizes synchronous context free grammar (SCFG) for translation derivation (Chiang, 2005; Chiang, 2007) and has been widely adopted in statistical machine translation (SMT). |
Introduction | However, the two approaches are not mutually exclusive, as we could also include a set of syntax-driven features into our translation model . |
Abstract | Two decades after their invention, the IBM word-based translation models , widely available in the GIZA++ toolkit, remain the dominant approach to word alignment and an integral part of many statistical translation systems. |
Abstract | In this paper, we propose a simple extension to the IBM models: an 60 prior to encourage sparsity in the word-to-word translation model . |
Conclusion | We have extended the IBM models and HMM model by the addition of an (0 prior to the word-to-word translation model , which compacts the word-to-word translation table, reducing overfitting, and, in particular, the “garbage collection” effect. |
Experiments | Table 4 shows B scores for translation models learned from these alignments. |
Introduction | Although state-of—the-art translation models use rules that operate on units bigger than words (like phrases or tree fragments), they nearly always use word alignments to drive extraction of those translation rules. |
Introduction | It extends the IBM/HMM models by incorporating an (0 prior, inspired by the principle of minimum description length (Barron et al., 1998), to encourage sparsity in the word-to-word translation model (Section 2.2). |
Conclusion and Future Work | Finally, we hope to apply our method to other translation models , especially syntax-based models. |
Decoding | In the topic-specific lexicon translation model , given a source document, it first calculates the topic-specific translation probability by normalizing the entire lexicon translation table, and then adapts the lexical weights of rules correspondingly. |
Experiments | The adapted lexicon translation model is added as a new feature under the discriminative framework. |
Introduction | To exploit topic information for statistical machine translation (SMT), researchers have proposed various topic-specific lexicon translation models (Zhao and Xing, 2006; Zhao and Xing, 2007; Tam et al., 2007) to improve translation quality. |
Introduction | Topic-specific lexicon translation models focus on word-level translations. |
Related Work | combine a specific domain translation model with a general domain translation model depending on various text distances. |
Abstract | We use these topic distributions to compute topic-dependent lexical weighting probabilities and directly incorporate them into our translation model as features. |
Discussion and Conclusion | We can construct a topic model once on the training data, and use it infer topics on any test set to adapt the translation model . |
Introduction | This problem has led to a substantial amount of recent work in trying to bias, or adapt, the translation model (TM) toward particular domains of interest (Axelrod et al., 2011; Foster et al., 2010; Snover et al., 2008).1 The intuition behind TM adaptation is to increase the likelihood of selecting relevant phrases for translation. |
Introduction | We induce unsupervised domains from large corpora, and we incorporate soft, probabilistic domain membership into a translation model . |
Introduction | We accomplish this by introducing topic dependent lexical probabilities directly as features in the translation model , and interpolating them log-linearly with our other features, thus allowing us to discriminatively optimize their weights on an arbitrary objective function. |
Introduction | The translation quality of the SMT system is highly related to the coverage of translation models . |
Introduction | Naturally, a solution to the coverage problem is to bridge the gaps between the input sentences and the translation models , either from the input side, which targets on rewriting the input sentences to the MT-favored expressions, or from |
Introduction | the side of translation models, which tries to enrich the translation models to cover more expressions. |