Syntax-to-Morphology Mapping in Factored Phrase-Based Statistical Machine Translation from English to Turkish
Yeniterzi, Reyyan and Oflazer, Kemal

Article Structure

Abstract

We present a novel scheme to apply factored phrase-based SMT to a language pair with very disparate morphological structures.

Introduction

Statistical machine translation into a morphologically complex language such as Turkish, Finnish or Arabic, involves the generation of target words with the proper morphology, in addition to properly ordering the target words.

Syntax-to-Morphology Mapping

In this section, we describe how we map between certain source language syntactic structures and target words with complex morphological structures.

Experimental Setup and Results

3.1 Data Preparation

Experiments with Constituent Reordering

The transformations in the previous section do not perform any constituent level reordering, but rather eliminate certain English function words as tokens in the text and fold them into complex syntactic tags.

Related Work

Statistical Machine Translation into a morphologically rich language is a challenging problem in that, on the target side, the decoder needs to generate both the right sequence of constituents and the right sequence of morphemes for each word.

Conclusions

We have presented a novel way to incorporate source syntactic structure in English-to-Turkish phrase-based machine translation by parsing the source sentences and then encoding many local and nonlocal source syntactic structures as additional complex tag factors.

Topics

BLEU

Appears in 35 sentences as: BLEU (35)
In Syntax-to-Morphology Mapping in Factored Phrase-Based Statistical Machine Translation from English to Turkish
  1. We incrementally explore capturing various syntactic substructures as complex tags on the English side, and evaluate how our translations improve in BLEU scores.
    Page 1, “Abstract”
  2. Our maximal set of source and target side transformations, coupled with some additional techniques, provide an 39% relative improvement from a baseline 17.08 to 23.78 BLEU , all averaged over 10 training and test sets.
    Page 1, “Abstract”
  3. We find that with the full set of syntax-to-morphology transformations and some additional techniques we can get about 39% relative improvement in BLEU scores over a word-based baseline and about 28% improvement of a factored baseline, all experiments being done over 10 training and test sets.
    Page 2, “Introduction”
  4. We find (and elaborate later) that this reduction in the English side of the training corpus, in general, is about 30%, and is correlated with improved BLEU scores.
    Page 3, “Syntax-to-Morphology Mapping”
  5. For evaluation, we used the BLEU metric (Pap-ineni et al., 2001).
    Page 5, “Experimental Setup and Results”
  6. Wherever meaningful, we report the average BLEU scores over 10 data sets along with the maximum and minimum values and the standard deviation.
    Page 5, “Experimental Setup and Results”
  7. We can observe that the combined syntax-to-morphology transformations on the source side provide a substantial improvement by themselves and a simple target side transformation on top of those provides a further boost to 21.96 BLEU which represents a 28.57% relative improvement over the word-based baseline and a 18.00% relative improvement over the factored baseline.
    Page 6, “Experimental Setup and Results”
  8. Table 1: BLEU scores for a variety of transformation combinations
    Page 6, “Experimental Setup and Results”
  9. 15Note than in this case, the translations would be generated in the same format, but we then split such postpositions from the words they are attached to, during decoding, and then evaluate the BLEU score.
    Page 6, “Experimental Setup and Results”
  10. While N0un+Adj transformations give us an increase of 2.73 BLEU points, Verbs improve the result by only 0.8 points and improvement with Adverbs is even lower.
    Page 6, “Experimental Setup and Results”
  11. To understand why we get such a difference, we investigated the correlation of the decrease in the number of tokens on both sides of the parallel data, with the change in BLEU scores.
    Page 6, “Experimental Setup and Results”

See all papers in Proc. ACL 2010 that mention BLEU.

See all papers in Proc. ACL that mention BLEU.

Back to top.

BLEU scores

Appears in 22 sentences as: BLEU score (7) BLEU Scores (2) BLEU scores (13)
In Syntax-to-Morphology Mapping in Factored Phrase-Based Statistical Machine Translation from English to Turkish
  1. We incrementally explore capturing various syntactic substructures as complex tags on the English side, and evaluate how our translations improve in BLEU scores .
    Page 1, “Abstract”
  2. We find that with the full set of syntax-to-morphology transformations and some additional techniques we can get about 39% relative improvement in BLEU scores over a word-based baseline and about 28% improvement of a factored baseline, all experiments being done over 10 training and test sets.
    Page 2, “Introduction”
  3. We find (and elaborate later) that this reduction in the English side of the training corpus, in general, is about 30%, and is correlated with improved BLEU scores .
    Page 3, “Syntax-to-Morphology Mapping”
  4. Wherever meaningful, we report the average BLEU scores over 10 data sets along with the maximum and minimum values and the standard deviation.
    Page 5, “Experimental Setup and Results”
  5. Table 1: BLEU scores for a variety of transformation combinations
    Page 6, “Experimental Setup and Results”
  6. 15Note than in this case, the translations would be generated in the same format, but we then split such postpositions from the words they are attached to, during decoding, and then evaluate the BLEU score .
    Page 6, “Experimental Setup and Results”
  7. To understand why we get such a difference, we investigated the correlation of the decrease in the number of tokens on both sides of the parallel data, with the change in BLEU scores .
    Page 6, “Experimental Setup and Results”
  8. The graph in Figure 3 plots the BLEU scores and the number of tokens in the two sides of the training data as the data is modified with transformations.
    Page 6, “Experimental Setup and Results”
  9. We can see that as the number of tokens in English decrease, the BLEU score increases.
    Page 6, “Experimental Setup and Results”
  10. In order to measure the relationship between these two variables statistically, we performed a correlation analysis and found that there is a strong negative correlation of -0.99 between the BLEU score and the number of English tokens.
    Page 6, “Experimental Setup and Results”
  11. We can also note that the largest reduction in the number of tokens comes with the application of the N0un+Adj transformations, which correlates with the largest increase in BLEU score .
    Page 6, “Experimental Setup and Results”

See all papers in Proc. ACL 2010 that mention BLEU scores.

See all papers in Proc. ACL that mention BLEU scores.

Back to top.

phrase-based

Appears in 11 sentences as: phrase-based (11)
In Syntax-to-Morphology Mapping in Factored Phrase-Based Statistical Machine Translation from English to Turkish
  1. We present a novel scheme to apply factored phrase-based SMT to a language pair with very disparate morphological structures.
    Page 1, “Abstract”
  2. Once these were identified as separate tokens, they were then used as “words” in a standard phrase-based framework (Koehn et al., 2003).
    Page 1, “Introduction”
  3. This facilitates the use of factored phrase-based translation that was not previously applicable due to the morphological complexity on the target side and mismatch between source and target morphologies.
    Page 2, “Introduction”
  4. We assume that the reader is familiar with the basics of phrase-based statistical machine translation (Koehn et al., 2003) and factored statistical machine translation (Koehn and Hoang, 2007).
    Page 2, “Introduction”
  5. We evaluated the impact of the transformations in factored phrase-based SMT with an English-Turkish data set which consists of 52712 parallel sentences.
    Page 4, “Experimental Setup and Results”
  6. As a baseline system, we built a standard phrase-based system, using the surface forms of the words without any transformations, and with a 3—gram LM in the decoder.
    Page 5, “Experimental Setup and Results”
  7. Factored phrase-based SMT allows the use of multiple language models for the target side, for different factors during decoding.
    Page 6, “Experimental Setup and Results”
  8. Koehn (2005) applied standard phrase-based SMT to Finnish using the Europarl corpus and reported that translation to Finnish had the worst BLEU scores.
    Page 8, “Related Work”
  9. Yang and Kirchhoff (2006) have used phrase-based backoff models to translate unknown words by morphologically decomposing the unknown source words.
    Page 8, “Related Work”
  10. They used both CCG supertags and LTAG su-pertags in Arabic-to-English phrase-based translation and have reported about 6% relative improvement in BLEU scores.
    Page 9, “Related Work”
  11. We have presented a novel way to incorporate source syntactic structure in English-to-Turkish phrase-based machine translation by parsing the source sentences and then encoding many local and nonlocal source syntactic structures as additional complex tag factors.
    Page 9, “Conclusions”

See all papers in Proc. ACL 2010 that mention phrase-based.

See all papers in Proc. ACL that mention phrase-based.

Back to top.

language models

Appears in 7 sentences as: language model (1) language modeling (1) language models (5)
In Syntax-to-Morphology Mapping in Factored Phrase-Based Statistical Machine Translation from English to Turkish
  1. The main reason given for these problems was that the same statistical translation, reordering and language modeling mechanisms were being employed to both determine the morphological structure of the words and, at the same time, get the global order of the words correct.
    Page 1, “Introduction”
  2. Furthermore, in factored models, we can employ different language models for different factors.
    Page 5, “Experimental Setup and Results”
  3. We believe that the use of multiple language models (some much less sparse than the surface LM) in the factored baseline is the main reason for the improvement.
    Page 6, “Experimental Setup and Results”
  4. 3.2.3 Experiments with higher-order language models
    Page 6, “Experimental Setup and Results”
  5. Factored phrase-based SMT allows the use of multiple language models for the target side, for different factors during decoding.
    Page 6, “Experimental Setup and Results”
  6. (about 3700) is small compared to distinct number of surface forms (about 52K) and distinct roots (about 15K including numbers), it makes sense to investigate the contribution of higher order 71- gram language models for the morphological tag factor on the target side, to see if we can address the observation in the previous section.
    Page 7, “Experimental Setup and Results”
  7. The problem is essentially one of generating multiple candidate sentences with the unattached function words ambiguously positioned (say in a lattice) and then use a second language model to rerank these sentences to select the target sentence.
    Page 9, “Conclusions”

See all papers in Proc. ACL 2010 that mention language models.

See all papers in Proc. ACL that mention language models.

Back to top.

LM

Appears in 7 sentences as: LM (7)
In Syntax-to-Morphology Mapping in Factored Phrase-Based Statistical Machine Translation from English to Turkish
  1. As a baseline system, we built a standard phrase-based system, using the surface forms of the words without any transformations, and with a 3—gram LM in the decoder.
    Page 5, “Experimental Setup and Results”
  2. We believe that the use of multiple language models (some much less sparse than the surface LM ) in the factored baseline is the main reason for the improvement.
    Page 6, “Experimental Setup and Results”
  3. Using a 4-gram root LM , considerably less sparse than word forms but more sparse that tags, we get a BLEU score of 22.80 (max: 24.07, min: 21.57, std: 0.85).
    Page 7, “Experimental Setup and Results”
  4. 3-gram root LM l-gr.
    Page 7, “Experimental Setup and Results”
  5. 4-gram root LM l-gr.
    Page 7, “Experimental Setup and Results”
  6. Table 3: Details of Word, Root and Morphology BLEU Scores, with 8-gram tag LM and 3/4-gram root LMs
    Page 7, “Experimental Setup and Results”
  7. 16These experiments were done on top of the model in 3.2.3 with a 3-gram word and root LMs and 8-gram tag LM .
    Page 8, “Experiments with Constituent Reordering”

See all papers in Proc. ACL 2010 that mention LM.

See all papers in Proc. ACL that mention LM.

Back to top.

machine translation

Appears in 7 sentences as: Machine Translation (1) machine translation (7)
In Syntax-to-Morphology Mapping in Factored Phrase-Based Statistical Machine Translation from English to Turkish
  1. Statistical machine translation into a morphologically complex language such as Turkish, Finnish or Arabic, involves the generation of target words with the proper morphology, in addition to properly ordering the target words.
    Page 1, “Introduction”
  2. We assume that the reader is familiar with the basics of phrase-based statistical machine translation (Koehn et al., 2003) and factored statistical machine translation (Koehn and Hoang, 2007).
    Page 2, “Introduction”
  3. Statistical Machine Translation into a morphologically rich language is a challenging problem in that, on the target side, the decoder needs to generate both the right sequence of constituents and the right sequence of morphemes for each word.
    Page 8, “Related Work”
  4. Using morphology in statistical machine translation has been addressed by many researchers for translation from or into morphologically rich(er) languages.
    Page 8, “Related Work”
  5. Goldwater and McClosky (2005) use morphological analysis on the Czech side to get improvements in Czech-to-English statistical machine translation .
    Page 8, “Related Work”
  6. Recently, Bisazza and Federico (2009) have applied morphological segmentation in Turkish-to-English statistical machine translation and found that it provides nontrivial BLEU
    Page 8, “Related Work”
  7. We have presented a novel way to incorporate source syntactic structure in English-to-Turkish phrase-based machine translation by parsing the source sentences and then encoding many local and nonlocal source syntactic structures as additional complex tag factors.
    Page 9, “Conclusions”

See all papers in Proc. ACL 2010 that mention machine translation.

See all papers in Proc. ACL that mention machine translation.

Back to top.

statistical machine translation

Appears in 6 sentences as: Statistical Machine Translation (1) Statistical machine translation (1) statistical machine translation (5)
In Syntax-to-Morphology Mapping in Factored Phrase-Based Statistical Machine Translation from English to Turkish
  1. Statistical machine translation into a morphologically complex language such as Turkish, Finnish or Arabic, involves the generation of target words with the proper morphology, in addition to properly ordering the target words.
    Page 1, “Introduction”
  2. We assume that the reader is familiar with the basics of phrase-based statistical machine translation (Koehn et al., 2003) and factored statistical machine translation (Koehn and Hoang, 2007).
    Page 2, “Introduction”
  3. Statistical Machine Translation into a morphologically rich language is a challenging problem in that, on the target side, the decoder needs to generate both the right sequence of constituents and the right sequence of morphemes for each word.
    Page 8, “Related Work”
  4. Using morphology in statistical machine translation has been addressed by many researchers for translation from or into morphologically rich(er) languages.
    Page 8, “Related Work”
  5. Goldwater and McClosky (2005) use morphological analysis on the Czech side to get improvements in Czech-to-English statistical machine translation .
    Page 8, “Related Work”
  6. Recently, Bisazza and Federico (2009) have applied morphological segmentation in Turkish-to-English statistical machine translation and found that it provides nontrivial BLEU
    Page 8, “Related Work”

See all papers in Proc. ACL 2010 that mention statistical machine translation.

See all papers in Proc. ACL that mention statistical machine translation.

Back to top.

part-of-speech

Appears in 5 sentences as: Part-of-Speech (1) part-of-speech (4)
In Syntax-to-Morphology Mapping in Factored Phrase-Based Statistical Machine Translation from English to Turkish
  1. They have reported that, given the typical complexity of Turkish words, there was a substantial percentage of words whose morphological structure was incorrect: either the morphemes were not applicable for the part-of-speech category of the root word selected, or the morphemes were in the wrong order.
    Page 1, “Introduction”
  2. Part-of-Speech Tags for the English words: +IN -:eposition; +PRP$ - Possessive Pronoun; +JJ - Adjective; .‘IN - Noun; +NNS - Plural Noun.
    Page 2, “Syntax-to-Morphology Mapping”
  3. The first marker is the part-of-speech tag of the root and the remainder are the overt inflectional and derivational markers of the word.
    Page 4, “Experimental Setup and Results”
  4. Instead of using just the surface form of the word, we included the root, part-of-speech and morphological tag information into the corpus as additional factors alongside the surface form.13 Thus, a token is represented with three factors as Surface | Root | Tags where Tags are complex tags on the English side, and morphological tags on the Turkish side.14
    Page 5, “Experimental Setup and Results”
  5. Popovic and Ney (2004) investigated improving translation quality from inflected languages by using stems, suffixes and part-of-speech tags.
    Page 8, “Related Work”

See all papers in Proc. ACL 2010 that mention part-of-speech.

See all papers in Proc. ACL that mention part-of-speech.

Back to top.

baseline system

Appears in 4 sentences as: baseline system (3) Baseline Systems (1)
In Syntax-to-Morphology Mapping in Factored Phrase-Based Statistical Machine Translation from English to Turkish
  1. 3.2.1 The Baseline Systems
    Page 5, “Experimental Setup and Results”
  2. As a baseline system , we built a standard phrase-based system, using the surface forms of the words without any transformations, and with a 3—gram LM in the decoder.
    Page 5, “Experimental Setup and Results”
  3. We also built a second baseline system with a factored model.
    Page 5, “Experimental Setup and Results”
  4. the baseline system and the highest performance is attained when all transformations are performed.
    Page 6, “Experimental Setup and Results”

See all papers in Proc. ACL 2010 that mention baseline system.

See all papers in Proc. ACL that mention baseline system.

Back to top.

morphological analysis

Appears in 4 sentences as: morphological analysis (3) morphological analyzer (1)
In Syntax-to-Morphology Mapping in Factored Phrase-Based Statistical Machine Translation from English to Turkish
  1. On the target side (Turkish), we only perform morphological analysis and disambiguation but treat the complete complex morphological tag as a factor, instead of separating morphemes.
    Page 1, “Abstract”
  2. On the Turkish side, we perform a full morphological analysis , (Oflazer, 1994), and morphological disambiguation (Yuret and Ture, 2006) to select the contextually salient interpretation of words.
    Page 4, “Experimental Setup and Results”
  3. 6For example, the morphological analyzer outputs +A3 s g to mark a singular noun, if there is no explicit plural morpheme.
    Page 4, “Experimental Setup and Results”
  4. Goldwater and McClosky (2005) use morphological analysis on the Czech side to get improvements in Czech-to-English statistical machine translation.
    Page 8, “Related Work”

See all papers in Proc. ACL 2010 that mention morphological analysis.

See all papers in Proc. ACL that mention morphological analysis.

Back to top.

BLEU points

Appears in 3 sentences as: BLEU point (1) BLEU points (2)
In Syntax-to-Morphology Mapping in Factored Phrase-Based Statistical Machine Translation from English to Turkish
  1. While N0un+Adj transformations give us an increase of 2.73 BLEU points , Verbs improve the result by only 0.8 points and improvement with Adverbs is even lower.
    Page 6, “Experimental Setup and Results”
  2. (2007) have integrated more syntax in a factored translation approach by using CCG su-pertags as a separate factor and have reported a 0.46 BLEU point improvement in Dutch-to-English translations.
    Page 9, “Related Work”
  3. In the context of reordering, one recent work (Xu et al., 2009), was able to get an improvement of 0.6 BLEU points by using source syntactic analysis and a constituent reordering scheme like ours for English-to-Turkish translation, but without using any morphology.
    Page 9, “Related Work”

See all papers in Proc. ACL 2010 that mention BLEU points.

See all papers in Proc. ACL that mention BLEU points.

Back to top.

dependency relations

Appears in 3 sentences as: dependency relations (3)
In Syntax-to-Morphology Mapping in Factored Phrase-Based Statistical Machine Translation from English to Turkish
  1. When we tag and syntactically analyze the En-lish side into dependency relations , and morpho-)gically analyze and disambiguate the Turkish hrase, we get the representation in the middle of igure 1, where we have co-indexed components at should map to each other, and some of the vntactic relations that the function words are in-olved in are marked with dependency links.1
    Page 2, “Syntax-to-Morphology Mapping”
  2. Here <x>, <Y> and <z> can be considered as Prolog like-variables that bind to patterns (mostly root words), and the conditions check for specified dependency relations (e.g.,PMOD) between the left and the right sides.
    Page 3, “Syntax-to-Morphology Mapping”
  3. The sentence representations in the middle part of Figure 2 show these sentences with some of the dependency relations (relevant to our transformations) extracted by the parser, explicitly marked as labeled links.
    Page 4, “Experimental Setup and Results”

See all papers in Proc. ACL 2010 that mention dependency relations.

See all papers in Proc. ACL that mention dependency relations.

Back to top.

language pair

Appears in 3 sentences as: language pair (3)
In Syntax-to-Morphology Mapping in Factored Phrase-Based Statistical Machine Translation from English to Turkish
  1. We present a novel scheme to apply factored phrase-based SMT to a language pair with very disparate morphological structures.
    Page 1, “Abstract”
  2. 12The experience with MERT for this language pair has not been very positive.
    Page 5, “Experimental Setup and Results”
  3. In order to alleviate the lack of large scale parallel corpora for the English—Turkish language pair , we experimented with augmenting the training data with reliable phrase pairs obtained from a previous alignment.
    Page 7, “Experimental Setup and Results”

See all papers in Proc. ACL 2010 that mention language pair.

See all papers in Proc. ACL that mention language pair.

Back to top.

Part-of-Speech Tags

Appears in 3 sentences as: part-of-speech tag (1) Part-of-Speech Tags (1) part-of-speech tags (1)
In Syntax-to-Morphology Mapping in Factored Phrase-Based Statistical Machine Translation from English to Turkish
  1. Part-of-Speech Tags for the English words: +IN -:eposition; +PRP$ - Possessive Pronoun; +JJ - Adjective; .‘IN - Noun; +NNS - Plural Noun.
    Page 2, “Syntax-to-Morphology Mapping”
  2. The first marker is the part-of-speech tag of the root and the remainder are the overt inflectional and derivational markers of the word.
    Page 4, “Experimental Setup and Results”
  3. Popovic and Ney (2004) investigated improving translation quality from inflected languages by using stems, suffixes and part-of-speech tags .
    Page 8, “Related Work”

See all papers in Proc. ACL 2010 that mention Part-of-Speech Tags.

See all papers in Proc. ACL that mention Part-of-Speech Tags.

Back to top.

phrase table

Appears in 3 sentences as: Phrase table (1) phrase table (2)
In Syntax-to-Morphology Mapping in Factored Phrase-Based Statistical Machine Translation from English to Turkish
  1. Phrase table entries for the surface factors produced by Moses after it does an alignment on the roots, contain the English (e) and Turkish (t) parts of a pair of aligned phrases, and the probabilities, p(e|t), the conditional probability that the English phrase is 6 given that the Turkish phrase is t, and p(t|e), the conditional probability that the Turkish phrase is t given the English phrase is 6.
    Page 7, “Experimental Setup and Results”
  2. Among these phrase table entries, those with p(e|t) m p(t|e) and p(t|e) + p(e|t) larger than some threshold, can be considered as reliable mutual translations, in that they mostly translate to each other and not much to others.
    Page 7, “Experimental Setup and Results”
  3. from the phrase table those phrases with 0.9 S p(elt)/p(tle) S 1.1 and Mile) + Melt) 2 1.5 and added them to the training data to further bias the alignment process.
    Page 8, “Experimental Setup and Results”

See all papers in Proc. ACL 2010 that mention phrase table.

See all papers in Proc. ACL that mention phrase table.

Back to top.