Topological Ordering of Function Words in Hierarchical Phrase-based Translation
Setiawan, Hendra and Kan, Min Yen and Li, Haizhou and Resnik, Philip

Article Structure

Abstract

Hierarchical phrase-based models are attractive because they provide a consistent framework within which to characterize both local and long-distance reorderings, but they also make it difficult to distinguish many implausible reorderings from those that are linguistically plausible.

Introduction

Hierarchical phrase-based models (Chiang, 2005; Chiang, 2007) offer a number of attractive benefits in statistical machine translation (SMT), while maintaining the strengths of phrase-based systems (Koehn et al., 2003).

Hierarchical Phrase-based System

Formally, a hierarchical phrase-based SMT system is based on a weighted synchronous context free grammar (SCFG) with one type of nonterminal symbol.

Overgeneration and Topological Ordering of Function Words

The use of only one type of nonterminal allows a flexible permutation of the topological ordering of the same set of rules, resulting in a huge number of possible derivations from a given source sentence.

Pairwise Dominance Model

Our example suggests that we may be able to improve the translation model’s sensitivity to correct versus incorrect reordering choices by modeling the topological ordering of function words.

Parameter Estimation

Learning the dominance model involves extracting d values for every pair of neighboring function words in the training bitext.

Experimental Setup

We tested the effect of introducing the pairwise dominance model into hierarchical phrase-based translation on Chinese-to-English and Arabic-to-English translation tasks, thus studying its effect in two languages where the use of function words differs significantly.

Experimental Results

Chinese-to-English experiments.

Discussion and Future Work

The results in both sets of experiments show consistently that we have achieved a significant gains by modeling the topological ordering of function words.

Related Work

In the introduction, we discussed Chiang’s (2005) constituency feature, related ideas explored by Marton and Resnik (2008) and Chiang et al.

Conclusion

We have presented a pairwise dominance model to address reordering issues that are not handled particularly well by standard hierarchical phrase-based modeling.

Topics

phrase-based

Appears in 19 sentences as: phrase-based (21)
In Topological Ordering of Function Words in Hierarchical Phrase-based Translation
  1. Hierarchical phrase-based models are attractive because they provide a consistent framework within which to characterize both local and long-distance reorderings, but they also make it difficult to distinguish many implausible reorderings from those that are linguistically plausible.
    Page 1, “Abstract”
  2. Rather than appealing to annotation-driven syntactic modeling, we address this problem by observing the influential role of function words in determining syntactic structure, and introducing soft constraints on function word relationships as part of a standard log-linear hierarchical phrase-based model.
    Page 1, “Abstract”
  3. Hierarchical phrase-based models (Chiang, 2005; Chiang, 2007) offer a number of attractive benefits in statistical machine translation (SMT), while maintaining the strengths of phrase-based systems (Koehn et al., 2003).
    Page 1, “Introduction”
  4. To model such a reordering, a hierarchical phrase-based system demands no additional parameters, since long and short distance reorderings are modeled identically using synchronous context free grammar (SCFG) rules.
    Page 1, “Introduction”
  5. Interestingly, hierarchical phrase-based models provide this benefit without making any linguistic commitments beyond the structure of the model.
    Page 1, “Introduction”
  6. In this paper, we pursue a different approach to improving reordering choices in a hierarchical phrase-based model.
    Page 1, “Introduction”
  7. In Section 2, we briefly review hierarchical phrase-based models.
    Page 2, “Introduction”
  8. Formally, a hierarchical phrase-based SMT system is based on a weighted synchronous context free grammar (SCFG) with one type of nonterminal symbol.
    Page 2, “Hierarchical Phrase-based System”
  9. Synchronous rules in hierarchical phrase-based models take the following form:
    Page 2, “Hierarchical Phrase-based System”
  10. Translation of a source sentence 6 using hierarchical phrase-based models is formulated as a search for the most probable derivation D* whose source side is equal to e:
    Page 2, “Hierarchical Phrase-based System”
  11. The problem may be less severe in hierarchical phrase-based MT than in BTG, since lexical items on the rules’ right hand sides often limit the span of nonterminals.
    Page 3, “Overgeneration and Topological Ordering of Function Words”

See all papers in Proc. ACL 2009 that mention phrase-based.

See all papers in Proc. ACL that mention phrase-based.

Back to top.

NIST

Appears in 6 sentences as: NIST (8)
In Topological Ordering of Function Words in Hierarchical Phrase-based Translation
  1. We trained the system on the NIST MT06 Eval corpus excluding the UN data (approximately 900K sentence pairs).
    Page 6, “Experimental Setup”
  2. We used the NIST MT03 test set as the development set for optimizing interpolation weights using minimum error rate training (MERT; (Och and Ney, 2002)).
    Page 6, “Experimental Setup”
  3. We carried out evaluation of the systems on the NIST 2006 evaluation test (MT06) and the NIST 2008 evaluation test (MT08).
    Page 6, “Experimental Setup”
  4. We trained the system on a subset of 950K sentence pairs from the NIST MT08 training data, selected by
    Page 6, “Experimental Setup”
  5. We used the NIST MT03 test set as the development set for optimizing the interpolation weights using MERT.
    Page 6, “Experimental Setup”
  6. We carried out the evaluation of the systems on the NIST 2006 evaluation set (MT06) and the NIST 2008 evaluation set (MT08).
    Page 6, “Experimental Setup”

See all papers in Proc. ACL 2009 that mention NIST.

See all papers in Proc. ACL that mention NIST.

Back to top.

Statistically significant

Appears in 5 sentences as: statistical significance (1) Statistically significant (2) statistically significant (2)
In Topological Ordering of Function Words in Hierarchical Phrase-based Translation
  1. all experiments, we report performance using the BLEU score (Papineni et al., 2002), and we assess statistical significance using the standard bootstrapping approach introduced by (Koehn, 2004).
    Page 6, “Experimental Setup”
  2. Doubling the number of words (N = 64) produces a small gain, and defining the pairwise dominance model using N = 128 most frequent words produces a statistically significant 1-point gain over the baseline (p < 0.01).
    Page 6, “Experimental Results”
  3. Larger values of N yield statistically significant performance above the baseline, but without further improvements over N = 128.
    Page 6, “Experimental Results”
  4. Statistically significant results (p < 0.01) over the baseline are in bold.
    Page 7, “Experimental Results”
  5. Statistically significant results over the baseline (p < 0.01) are in bold.
    Page 7, “Experimental Results”

See all papers in Proc. ACL 2009 that mention Statistically significant.

See all papers in Proc. ACL that mention Statistically significant.

Back to top.

BLEU

Appears in 4 sentences as: BLEU (4)
In Topological Ordering of Function Words in Hierarchical Phrase-based Translation
  1. all experiments, we report performance using the BLEU score (Papineni et al., 2002), and we assess statistical significance using the standard bootstrapping approach introduced by (Koehn, 2004).
    Page 6, “Experimental Setup”
  2. These results confirm that the pairwise dominance model can significantly increase performance as measured by the BLEU score, with a consistent pattern of results across the MT06 and MT08 test sets.
    Page 6, “Experimental Results”
  3. When we visually inspect and compare the outputs of our system with those of the baseline, we observe that improved BLEU score often corresponds to visible improvements in the subjective translation quality.
    Page 7, “Discussion and Future Work”
  4. Perhaps surprisingly, translation performance, 30.90 BLEU , was around the level we obtained when using frequency to approximate function words at N = 64.
    Page 8, “Discussion and Future Work”

See all papers in Proc. ACL 2009 that mention BLEU.

See all papers in Proc. ACL that mention BLEU.

Back to top.

BLEU score

Appears in 3 sentences as: BLEU score (3)
In Topological Ordering of Function Words in Hierarchical Phrase-based Translation
  1. all experiments, we report performance using the BLEU score (Papineni et al., 2002), and we assess statistical significance using the standard bootstrapping approach introduced by (Koehn, 2004).
    Page 6, “Experimental Setup”
  2. These results confirm that the pairwise dominance model can significantly increase performance as measured by the BLEU score , with a consistent pattern of results across the MT06 and MT08 test sets.
    Page 6, “Experimental Results”
  3. When we visually inspect and compare the outputs of our system with those of the baseline, we observe that improved BLEU score often corresponds to visible improvements in the subjective translation quality.
    Page 7, “Discussion and Future Work”

See all papers in Proc. ACL 2009 that mention BLEU score.

See all papers in Proc. ACL that mention BLEU score.

Back to top.

language model

Appears in 3 sentences as: language model (3)
In Topological Ordering of Function Words in Hierarchical Phrase-based Translation
  1. Given 6 and f as the source and target phrases associated with the rule, typical features used are rule’s translation probability Ptmn,(f|e') and its inverse Ptmn,(e'| f), the lexical probability Pl“ (fl 6) and its inverse Pl“ (6 | f Systems generally also employ a word penalty, a phrase penalty, and target language model feature.
    Page 2, “Hierarchical Phrase-based System”
  2. For the language model , we used a 5-gram model with modified Kneser-Ney smoothing (Kneser and Ney, 1995) trained on the English side of our training data as well as portions of the Giga-word v2 English corpus.
    Page 6, “Experimental Setup”
  3. For the language model , we used a 5-gram model trained on the English portion of the whole training data plus portions of the Gigaword v2 corpus.
    Page 6, “Experimental Setup”

See all papers in Proc. ACL 2009 that mention language model.

See all papers in Proc. ACL that mention language model.

Back to top.

sentence pairs

Appears in 3 sentences as: sentence pairs (3)
In Topological Ordering of Function Words in Hierarchical Phrase-based Translation
  1. We trained the system on the NIST MT06 Eval corpus excluding the UN data (approximately 900K sentence pairs ).
    Page 6, “Experimental Setup”
  2. We trained the system on a subset of 950K sentence pairs from the NIST MT08 training data, selected by
    Page 6, “Experimental Setup”
  3. The subsampling algorithm selects sentence pairs from the training data in a way that seeks reasonable representation for all n-grams appearing in the test set.
    Page 6, “Experimental Setup”

See all papers in Proc. ACL 2009 that mention sentence pairs.

See all papers in Proc. ACL that mention sentence pairs.

Back to top.