Phrase-Based Statistical Machine Translation as a Traveling Salesman Problem

An efficient decoding algorithm is a crucial element of any statistical machine translation system.

Phrase-based systems (Koehn et al., 2003) are probably the most widespread class of Statistical Machine Translation systems, and arguably one of the most successful.

Beam-search decoding

In this paper the Traveling Salesman Problem appears in four variants:

In this section we reformulate the SMT decoding problem as an AGTSP.

5.1 Monolingual word reordering

In section 4.1, we described a general “compiling out” method for extending our TSP representation to handling trigram and N- gram language models, but we noted that the method may lead to combinatorial explosion of the TSP graph.

The main contribution of this paper has been to propose a transformation for an arbitrary phrase-based SMT decoding instance into a TSP instance.

Appears in 14 sentences as: Language Model (2) language model (7) language models (6)

In *Phrase-Based Statistical Machine Translation as a Traveling Salesman Problem*

- Typical nonlocal features include one or more n-gram language models as well as a distortion feature, measuring by how much the order of biphrases in the candidate translation deviates from their order in the source sentence.Page 1, “Introduction”
- o The language model cost of producing the target words of 19’ right after the target words of b; with a bigram language model , this cost can be precomputed directly from b and b’.Page 4, “Phrase-based Decoding as TSP”
- Successful phrase-based systems typically employ language models of order higher than two.Page 4, “Phrase-based Decoding as TSP”
- If we want to extend the power of the model to general n-gram language models , and in particular to the 3-gramPage 4, “Phrase-based Decoding as TSP”
- If we do that exhaustively for all biphrases (relevant for the source sentence at hand) that, like i, have a single-word target, we will obtain a representation that allows a trigram language model to be computed at each point.Page 5, “Phrase-based Decoding as TSP”
- The problem becomes even worse if we extend the compiling-out method to n-gram language models with n > 3.Page 5, “Phrase-based Decoding as TSP”
- First we consider a bigram Language Model and the algorithms try to find the reordering that maximizes the LM score.Page 6, “Experiments”
- Then we consider a trigram based Language Model and the algorithms again try to maximize the LM score.Page 6, “Experiments”
- This means that, when using a bigram language model , it is often possible to reorder the words of a randomly permuted reference sentence in such a way that the LM score of the reordered sentence is larger than the LM of the reference.Page 6, “Experiments”
- The experiments with trigram language models (Figures 5(c,d)) show similar trends to those with bigrams.Page 7, “Experiments”
- 5.2 Translation experiments with a bigram language modelPage 7, “Experiments”

See all papers in *Proc. ACL 2009* that mention language model.

See all papers in *Proc. ACL* that mention language model.

Back to top.

Appears in 11 sentences as: BLEU (11) Bleu (1)

In *Phrase-Based Statistical Machine Translation as a Traveling Salesman Problem*

- BLEU scorePage 6, “Experiments”
- ond score is BLEU (Papineni et al., 2001), computed between the reconstructed and the original sentences, which allows us to check how well the quality of reconstruction correlates with the internal score.Page 6, “Experiments”
- In Figure 5b, we report the BLEU score of the reordered sentences in the test set relative to the original reference sentences.Page 7, “Experiments”
- Here we see that the exact-TSP outputs are closer to the references in terms of BLEU than the beam-search solutions.Page 7, “Experiments”
- Figure 6 presents Decoder and Bleu scores as functions of time for the two corpuses.Page 7, “Experiments”
- In the case of the Europarl corpus, we observe that LK outperforms Beam-Search in terms of the Decoder score as well as in terms of the BLEU score.Page 7, “Experiments”
- In addition, it is important to note that the BLEU scores obtained in these experiments correspond to feature weights, in the log-linear model (1), that have been optimized for the Moses decoder, but not for the TSP decoder: optimizing these parameters relatively to the TSP decoder could improve its BLEU scores still further.Page 7, “Experiments”
- The situation with the BLEU score is more confuse.Page 7, “Experiments”
- Both algorithms do not show any clear score improvement with increasing running time which suggests that the decoder’s objective function is not very well correlated with the BLEU score on this corpus.Page 7, “Experiments”
- BLEU scorePage 8, “Future Work”
- BLEU score 0 iv A (A)Page 8, “Conclusion”

See all papers in *Proc. ACL 2009* that mention BLEU.

See all papers in *Proc. ACL* that mention BLEU.

Back to top.

Appears in 11 sentences as: LM (12)

In *Phrase-Based Statistical Machine Translation as a Traveling Salesman Problem*

- 4.1 From Bigram to N-gram LMPage 4, “Phrase-based Decoding as TSP”
- imizing the LM score over all possible permutations.Page 6, “Experiments”
- Usually the LM score is used as one component of a more complex decoder score which also includes biphrase and distortion scores.Page 6, “Experiments”
- But in this particular “translation task” from bad to good English, we consider that all “biphrases” are of the form 6 — e, where e is an English word, and we do not take into account any distortion: we only consider the quality of the permutation as it is measured by the LM component.Page 6, “Experiments”
- Since the decoding phase is then equivalent to a word reordering, the LM score may be used to compare the performance of different decoding algorithms.Page 6, “Experiments”
- The first one is just the “internal” LM score; since all three algorithms attempt to maximize this score, a natural evaluation procedure is to plot its value versus the elapsed time.Page 6, “Experiments”
- The training dataset for learning the LM consists of 50000 sentences from NewsCommentary corpus (Callison-Burch et al., 2008), the test dataset for word reordering consists of 170 sentences, the average length of test sentences is equal to 17 words.Page 6, “Experiments”
- First we consider a bigram Language Model and the algorithms try to find the reordering that maximizes the LM score.Page 6, “Experiments”
- Then we consider a trigram based Language Model and the algorithms again try to maximize the LM score.Page 6, “Experiments”
- This means that, when using a bigram language model, it is often possible to reorder the words of a randomly permuted reference sentence in such a way that the LM score of the reordered sentence is larger than the LM of the reference.Page 6, “Experiments”
- Although the TSP output does not recover the reference sentences (it produces sentences with a slightly higher LM score than the references), it does reconstruct the references better than the beam-search.Page 7, “Experiments”

See all papers in *Proc. ACL 2009* that mention LM.

See all papers in *Proc. ACL* that mention LM.

Back to top.

Appears in 9 sentences as: BLEU score (7) BLEU scores (2) Bleu scores (1)

In *Phrase-Based Statistical Machine Translation as a Traveling Salesman Problem*

- BLEU scorePage 6, “Experiments”
- In Figure 5b, we report the BLEU score of the reordered sentences in the test set relative to the original reference sentences.Page 7, “Experiments”
- Figure 6 presents Decoder and Bleu scores as functions of time for the two corpuses.Page 7, “Experiments”
- In the case of the Europarl corpus, we observe that LK outperforms Beam-Search in terms of the Decoder score as well as in terms of the BLEU score .Page 7, “Experiments”
- In addition, it is important to note that the BLEU scores obtained in these experiments correspond to feature weights, in the log-linear model (1), that have been optimized for the Moses decoder, but not for the TSP decoder: optimizing these parameters relatively to the TSP decoder could improve its BLEU scores still further.Page 7, “Experiments”
- The situation with the BLEU score is more confuse.Page 7, “Experiments”
- Both algorithms do not show any clear score improvement with increasing running time which suggests that the decoder’s objective function is not very well correlated with the BLEU score on this corpus.Page 7, “Experiments”
- BLEU scorePage 8, “Future Work”
- BLEU score 0 iv A (A)Page 8, “Conclusion”

See all papers in *Proc. ACL 2009* that mention BLEU score.

See all papers in *Proc. ACL* that mention BLEU score.

Back to top.

Appears in 9 sentences as: Phrase-based (1) phrase-based (8)

In *Phrase-Based Statistical Machine Translation as a Traveling Salesman Problem*

- In this paper, we focus on the reverse mapping, showing that any phrase-based SMT decoding problem can be directly reformulated as a TSP.Page 1, “Abstract”
- Phrase-based systems (Koehn et al., 2003) are probably the most widespread class of Statistical Machine Translation systems, and arguably one of the most successful.Page 1, “Introduction”
- We will see in the next section that some characteristics of beam-search make it a suboptimal choice for phrase-based decoding, and we will propose an alternative.Page 2, “Introduction”
- This alternative is based on the observation that phrase-based decoding can be very naturally cast as a Traveling Salesman Problem (TSP), one of the best studied problems in combinatorial optimization.Page 2, “Introduction”
- In (Tillmann and Ney, 2003) and (Tillmann, 2006), the authors modify a certain Dynamic Programming technique used for TSP for use with an IBM-4 word-based model and a phrase-based model respectively.Page 2, “Related work”
- with the mainstream phrase-based SMT models, and therefore making it possible to directly apply existing TSP solvers to SMT.Page 3, “Related work”
- As will be shown in the next section, phrase-based SMT decoding can be directly reformulated as an AGTSP.Page 4, “The Traveling Salesman Problem and its variants”
- Successful phrase-based systems typically employ language models of order higher than two.Page 4, “Phrase-based Decoding as TSP”
- The main contribution of this paper has been to propose a transformation for an arbitrary phrase-based SMT decoding instance into a TSP instance.Page 8, “Conclusion”

See all papers in *Proc. ACL 2009* that mention phrase-based.

See all papers in *Proc. ACL* that mention phrase-based.

Back to top.

Appears in 8 sentences as: Bigram (2) bigram (5) bigrams (1)

In *Phrase-Based Statistical Machine Translation as a Traveling Salesman Problem*

- o The language model cost of producing the target words of 19’ right after the target words of b; with a bigram language model, this cost can be precomputed directly from b and b’.Page 4, “Phrase-based Decoding as TSP”
- This restriction to bigram models will be removed in Section 4.1.Page 4, “Phrase-based Decoding as TSP”
- 4.1 From Bigram to N-gram LMPage 4, “Phrase-based Decoding as TSP”
- Bigram based reordering.Page 6, “Experiments”
- First we consider a bigram Language Model and the algorithms try to find the reordering that maximizes the LM score.Page 6, “Experiments”
- This means that, when using a bigram language model, it is often possible to reorder the words of a randomly permuted reference sentence in such a way that the LM score of the reordered sentence is larger than the LM of the reference.Page 6, “Experiments”
- The experiments with trigram language models (Figures 5(c,d)) show similar trends to those with bigrams .Page 7, “Experiments”
- 5.2 Translation experiments with a bigram language modelPage 7, “Experiments”

See all papers in *Proc. ACL 2009* that mention bigram.

See all papers in *Proc. ACL* that mention bigram.

Back to top.

Appears in 6 sentences as: Machine Translation (1) machine translation (5)

In *Phrase-Based Statistical Machine Translation as a Traveling Salesman Problem*

- An efficient decoding algorithm is a crucial element of any statistical machine translation system.Page 1, “Abstract”
- Phrase-based systems (Koehn et al., 2003) are probably the most widespread class of Statistical Machine Translation systems, and arguably one of the most successful.Page 1, “Introduction”
- h - mt - z' - 3 this machine translation is strange h - c - t - z' - a this curious translation is automatic ht - s - z' - a this translation strange is automaticPage 4, “Phrase-based Decoding as TSP”
- For example, in the example of Figure 3, the cost of this - machine translation - is - strange, can only take into account the conditional probability of the word strange relative to the word is, but not relative to the words translation and is.Page 4, “Phrase-based Decoding as TSP”
- machine translation .Page 5, “Phrase-based Decoding as TSP”
- machine translation is .Page 5, “Phrase-based Decoding as TSP”

See all papers in *Proc. ACL 2009* that mention machine translation.

See all papers in *Proc. ACL* that mention machine translation.

Back to top.

Appears in 5 sentences as: conditional probability (5)

In *Phrase-Based Statistical Machine Translation as a Traveling Salesman Problem*

- They use aligned sequences of words, called biphrases, as building blocks for translations, and score alternative candidate translations for the same source sentence based on a log-linear model of the conditional probability of target sentences given the source sentence:Page 1, “Introduction”
- alignment a, where the alignment is a representation of the sequence of biphrases that where used in order to build T from S; The Ak’s are weights and ZS is a normalization factor that guarantees that p is a proper conditional probability distribution over the pairs (T, A).Page 1, “Introduction”
- These typically include forward and reverse phrase conditional probability features log p(f|§) as well as logp(§ where 59' is the source side of the biphrase and f the target side, and the so-called “phrase penalty” and “word penalty” features, which count the number of phrases and words in the alignment.Page 1, “Introduction”
- its translation has larger conditional probability ) or because the associated heuristics is less tight, hence more optimistic.Page 2, “Related work”
- For example, in the example of Figure 3, the cost of this - machine translation - is - strange, can only take into account the conditional probability of the word strange relative to the word is, but not relative to the words translation and is.Page 4, “Phrase-based Decoding as TSP”

See all papers in *Proc. ACL 2009* that mention conditional probability.

See all papers in *Proc. ACL* that mention conditional probability.

Back to top.

Appears in 5 sentences as: N-gram (2) n-gram (3)

In *Phrase-Based Statistical Machine Translation as a Traveling Salesman Problem*

- Typical nonlocal features include one or more n-gram language models as well as a distortion feature, measuring by how much the order of biphrases in the candidate translation deviates from their order in the source sentence.Page 1, “Introduction”
- 4.1 From Bigram to N-gram LMPage 4, “Phrase-based Decoding as TSP”
- If we want to extend the power of the model to general n-gram language models, and in particular to the 3-gramPage 4, “Phrase-based Decoding as TSP”
- The problem becomes even worse if we extend the compiling-out method to n-gram language models with n > 3.Page 5, “Phrase-based Decoding as TSP”
- This powerful method, which was proposed in (Kam and Kopec, 1996; Popat et al., 2001) in the context of a finite-state model (but not of TSP), can be easily extended to N-gram situations, and typically converges in a small number of iterations.Page 8, “Future Work”

See all papers in *Proc. ACL 2009* that mention n-gram.

See all papers in *Proc. ACL* that mention n-gram.

Back to top.

Appears in 3 sentences as: objective function (3)

In *Phrase-Based Statistical Machine Translation as a Traveling Salesman Problem*

- LK works by generating an initial random feasible solution for the TSP problem, and then repeatedly identifying an ordered subset of k edges in the current tour and an ordered subset of k edges not included in the tour such that when they are swapped the objective function is improved.Page 3, “The Traveling Salesman Problem and its variants”
- Both algorithms do not show any clear score improvement with increasing running time which suggests that the decoder’s objective function is not very well correlated with the BLEU score on this corpus.Page 7, “Experiments”
- SMT comparably or better than the state-of-the-art beam-search strategy, converging on solutions with higher objective function in a shorter time.Page 8, “Conclusion”

See all papers in *Proc. ACL 2009* that mention objective function.

See all papers in *Proc. ACL* that mention objective function.

Back to top.

Appears in 3 sentences as: translation task (1) translation tasks (1) “translation task” (1)

In *Phrase-Based Statistical Machine Translation as a Traveling Salesman Problem*

- But in this particular “translation task” from bad to good English, we consider that all “biphrases” are of the form 6 — e, where e is an English word, and we do not take into account any distortion: we only consider the quality of the permutation as it is measured by the LM component.Page 6, “Experiments”
- In this section we consider two real translation tasks , namely, translation from English to French, trained on Europarl (Koehn et al., 2003) and translation from German to Spanish training on the NewsCommentary corpus.Page 7, “Experiments”
- Since in the real translation task , the size of the TSP graph is much larger than in the artificial reordering task (in our experiments the median size of the TSP graph was around 400 nodes, sometimes growing up to 2000 nodes), directly applying the exact TSP solver would take too long; instead we use the approximate LK algorithm and compare it to Beam-Search.Page 7, “Experiments”

See all papers in *Proc. ACL 2009* that mention translation task.

See all papers in *Proc. ACL* that mention translation task.

Back to top.