Machine Translation without Words through Substring Alignment
Neubig, Graham and Watanabe, Taro and Mori, Shinsuke and Kawahara, Tatsuya

Article Structure

Abstract

In this paper, we demonstrate that accurate machine translation is possible without the concept of “words,” treating MT as a problem of transformation between character strings.

Introduction

Traditionally, the task of statistical machine translation (SMT) is defined as translating a source sen-

Related Work on Data Sparsity in SMT

As traditional SMT systems treat all words as single tokens without considering their internal structure, major problems of data sparsity occur for less frequent tokens.

Alignment Methods

SMT systems are generally constructed from a parallel corpus consisting of target language sentences 5 and source language sentences 7-".

LookAhead Biparsing

In this work, we experiment with the alignment method of Neubig et al.

Substring Prior Probabilities

While the Bayesian phrasal ITG framework uses the previously mentioned phrase distribution Pt during search, it also allows for definition of a phrase pair prior probability Ppmofieg, f3), which can efficiently seed the search process with a bias towards phrase pairs that satisfy certain properties.

Experiments

In order to test the effectiveness of character-based translation, we performed experiments over a variety of language pairs and experimental settings.

Conclusion and Future Directions

This paper demonstrated that character-based translation can act as a unified framework for handling difficult problems in translation: morphology, compound words, transliteration, and segmentation.

Topics

co-occurrence

Appears in 13 sentences as: Co-occurrence (1) co-occurrence (13)
In Machine Translation without Words through Substring Alignment
  1. This method is attractive, as it is theoretically able to handle all sparsity phenomena in a single unified framework, but has only been shown feasible between similar language pairs such as Spanish-Catalan (Vilar et al., 2007), Swedish-Norwegian (Tiedemann, 2009), and Thai-Lao (Somlertlamvanich et al., 2008), which have a strong co-occurrence between single characters.
    Page 1, “Introduction”
  2. In this section, we overview an existing method used to calculate these prior probabilities, and also propose a new way to calculate priors based on substring co-occurrence statistics.
    Page 5, “Substring Prior Probabilities”
  3. 5.2 Substring Co-occurrence Priors
    Page 5, “Substring Prior Probabilities”
  4. Instead, we propose a method for using raw substring co-occurrence statistics to bias alignments towards substrings that often co-occur in the entire training corpus.
    Page 5, “Substring Prior Probabilities”
  5. This is similar to the method of Cromieres (2006), but instead of using these co-occurrence statistics as a heuristic alignment criterion, we incorporate them as a prior probability in a statistical model that can take into account mutual exclusivity of overlapping substrings in a sentence.
    Page 5, “Substring Prior Probabilities”
  6. While suffix arrays allow for efficient calculation of these statistics, storing all co-occurrence counts C(e, f) is an unrealistic memory burden for larger
    Page 5, “Substring Prior Probabilities”
  7. This has a dual effect of reducing the amount of memory needed to hold co-occurrence counts by removing values for which C(e, f) < d, as well as preventing over-fitting of the training data.
    Page 6, “Substring Prior Probabilities”
  8. To determine how to combine 0(6), 0( f ), and C(6, f) into prior probabilities, we performed preliminary experiments testing methods proposed by previous research including plain co-occurrence counts, the Dice coefficient, and X—squared statistics (Cromieres, 2006), as well as a new method of defining substring pair probabilities to be proportional to bidirectional conditional probabilities
    Page 6, “Substring Prior Probabilities”
  9. As the prior is only supposed to bias the model towards good solutions and not explicitly rule out any possibilities, we linearly interpolate the co-occurrence probability with the one-to-many Model 1 probability, which will give at least some probability mass to all substring pairs
    Page 6, “Substring Prior Probabilities”
  10. In this section, we compare the translation accuracies for character-based translation using the phrasal ITG model with and without the proposed improvements of substring co-occurrence priors and lookahead parsing as described in Sections 4 and 5.2.
    Page 8, “Experiments”
  11. Table 5: METEOR scores for alignment with and without lookahead and co-occurrence priors.
    Page 8, “Experiments”

See all papers in Proc. ACL 2012 that mention co-occurrence.

See all papers in Proc. ACL that mention co-occurrence.

Back to top.

language pairs

Appears in 10 sentences as: language pair (1) language pairs (10)
In Machine Translation without Words through Substring Alignment
  1. In an evaluation, we demonstrate that character-based translation can achieve results that compare to word-based systems while effectively translating unknown and uncommon words over several language pairs .
    Page 1, “Abstract”
  2. This method is attractive, as it is theoretically able to handle all sparsity phenomena in a single unified framework, but has only been shown feasible between similar language pairs such as Spanish-Catalan (Vilar et al., 2007), Swedish-Norwegian (Tiedemann, 2009), and Thai-Lao (Somlertlamvanich et al., 2008), which have a strong co-occurrence between single characters.
    Page 1, “Introduction”
  3. (2007) state and we confirm, accurate translations cannot be achieved when applying traditional translation techniques to character-based translation for less similar language pairs .
    Page 1, “Introduction”
  4. An evaluation on four language pairs with differing morphological properties shows that for distant language pairs , character-based SMT can achieve translation accuracy comparable to word-based systems.
    Page 2, “Introduction”
  5. However, while the approach is attractive conceptually, previous research has only been shown effective for closely related language pairs (Vilar et al., 2007; Tiedemann, 2009; Sornlertlamvanich et al., 2008).
    Page 3, “Related Work on Data Sparsity in SMT”
  6. In this work, we propose effective alignment techniques that allow character-based translation to achieve accurate translation results for both close and distant language pairs .
    Page 3, “Related Work on Data Sparsity in SMT”
  7. In order to test the effectiveness of character-based translation, we performed experiments over a variety of language pairs and experimental settings.
    Page 6, “Experiments”
  8. As previous research has shown that it is more difficult to translate into morphologically rich languages than into English (Koehn, 2005), we perform experiments translating in both directions for all language pairs .
    Page 7, “Experiments”
  9. This confirms that character-based translation is performing well on languages that have long words or ambiguous boundaries, and less well on language pairs with relatively strong one-to-one correspondence between words.
    Page 8, “Experiments”
  10. Table 3 shows that the results are comparable, with no significant difference in average scores for either language pair .
    Page 8, “Experiments”

See all papers in Proc. ACL 2012 that mention language pairs.

See all papers in Proc. ACL that mention language pairs.

Back to top.

alignment models

Appears in 5 sentences as: alignment model (2) alignment models (3)
In Machine Translation without Words through Substring Alignment
  1. One barrier to applying many-to-many alignment models to character strings is training cost.
    Page 2, “Introduction”
  2. Secondly, we describe a method to seed the search process using counts of all substring pairs in the corpus to bias the phrase alignment model .
    Page 2, “Introduction”
  3. Sparsity causes trouble for alignment models , both in the form of incorrectly aligned uncommon words, and in the form of garbage collection, where uncommon words in one language are incorrectly aligned to large segments of the sentence in the other language (Och and Ney, 2003).
    Page 2, “Related Work on Data Sparsity in SMT”
  4. These may be words in word-based alignment models or single characters in character-based alignment models.1 We define our alignment as of, where each element is a span ak = (s, t, u, 2)) indicating that the target string es, .
    Page 3, “Alignment Methods”
  5. The most well-known and widely-used models for bitext alignment are for one-to-many alignment, including the IBM models (Brown et al., 1993) and HMM alignment model (Vogel et al., 1996).
    Page 3, “Alignment Methods”

See all papers in Proc. ACL 2012 that mention alignment models.

See all papers in Proc. ACL that mention alignment models.

Back to top.

BLEU

Appears in 5 sentences as: BLEU (7)
In Machine Translation without Words through Substring Alignment
  1. Minimum error rate training was performed to maximize word-based BLEU score for all systems.11 For language models, word-based translation uses a word S-gram model, and character-based translation uses a character 12-gram model, both smoothed using interpolated Kneser—Ney.
    Page 7, “Experiments”
  2. We evaluate translation quality using BLEU score (Papineni et al., 2002), both on the word and character level (with n = 4), as well as METEOR (Denkowski and Lavie, 2011) on the word level.
    Page 7, “Experiments”
  3. When compared with word-based translation, character-based translation achieves better, comparable, or inferior results on character-based BLEU, comparable or inferior results on METEOR, and inferior results on word-based BLEU .
    Page 7, “Experiments”
  4. These are given partial credit by character-based BLEU (and to a lesser extent METEOR), but marked entirely wrong by word-based BLEU .
    Page 7, “Experiments”
  5. 12Similar results were found for character and word—based BLEU , but are omitted for lack of space.
    Page 8, “Conclusion and Future Directions”

See all papers in Proc. ACL 2012 that mention BLEU.

See all papers in Proc. ACL that mention BLEU.

Back to top.

conditional probabilities

Appears in 4 sentences as: conditional probabilities (2) conditional probability (2)
In Machine Translation without Words through Substring Alignment
  1. These models are by nature directional, attempting to find the alignments that maximize the conditional probability of the target sentence P(e{| f1], aK For computational reasons, the IBM models are restricted to aligning each word on the target side to a single word on the source side.
    Page 3, “Alignment Methods”
  2. In addition, we heuristically prune values for which the conditional probabilities P(e| f) or P(f|e) are less than some fixed value, which we set to 0.1 for the reported experiments.
    Page 6, “Substring Prior Probabilities”
  3. To determine how to combine 0(6), 0( f ), and C(6, f) into prior probabilities, we performed preliminary experiments testing methods proposed by previous research including plain co-occurrence counts, the Dice coefficient, and X—squared statistics (Cromieres, 2006), as well as a new method of defining substring pair probabilities to be proportional to bidirectional conditional probabilities
    Page 6, “Substring Prior Probabilities”
  4. Z = Z Pcooc(e|f)Pcooc(f|e)' {6,f;c(e,f)>d} The experiments showed that the bidirectional conditional probability method gave significantly better results than all other methods, so we adopt this for the remainder of our experiments.
    Page 6, “Substring Prior Probabilities”

See all papers in Proc. ACL 2012 that mention conditional probabilities.

See all papers in Proc. ACL that mention conditional probabilities.

Back to top.

machine translation

Appears in 4 sentences as: machine translation (4)
In Machine Translation without Words through Substring Alignment
  1. In this paper, we demonstrate that accurate machine translation is possible without the concept of “words,” treating MT as a problem of transformation between character strings.
    Page 1, “Abstract”
  2. Traditionally, the task of statistical machine translation (SMT) is defined as translating a source sen-
    Page 1, “Introduction”
  3. boundaries, all machine translation systems perform at least some precursory form of tokenization, splitting punctuation and words to prevent the sparsity that would occur if punctuated and non-punctuated words were treated as different entities.
    Page 1, “Introduction”
  4. In this paper, we propose improvements to the alignment process tailored to character-based machine translation , and demonstrate that it is, in fact, possible to achieve translation accuracies that ap-
    Page 1, “Introduction”

See all papers in Proc. ACL 2012 that mention machine translation.

See all papers in Proc. ACL that mention machine translation.

Back to top.

morphological analysis

Appears in 4 sentences as: morphological analysis (2) morphological analyzers (2)
In Machine Translation without Words through Substring Alignment
  1. A myriad of methods have been proposed to handle each of these phenomena individually, including morphological analysis , stemming, compound breaking, number regularization, optimizing word segmentation, and transliteration, which we outline in more detail in Section 2.
    Page 1, “Introduction”
  2. Previous works have attempted to handle morphology, decompounding and regularization through lemmatization, morphological analysis , or unsuperVised techniques (NieBen and Ney, 2000; Brown, 2002; Lee, 2004; Goldwater and McClosky, 2005; Talbot and Osborne, 2006; Mermer and Akin, 2010; Macherey et al., 2011).
    Page 2, “Related Work on Data Sparsity in SMT”
  3. unified framework, requiring no language specific tools such as morphological analyzers or word seg-menters.
    Page 3, “Related Work on Data Sparsity in SMT”
  4. morphological analyzers to normalize or split the sentence into morpheme streams (Corston-Oliver and Gamon, 2004).
    Page 3, “Alignment Methods”

See all papers in Proc. ACL 2012 that mention morphological analysis.

See all papers in Proc. ACL that mention morphological analysis.

Back to top.

data sparsity

Appears in 3 sentences as: data sparsity (3)
In Machine Translation without Words through Substring Alignment
  1. As traditional SMT systems treat all words as single tokens without considering their internal structure, major problems of data sparsity occur for less frequent tokens.
    Page 2, “Related Work on Data Sparsity in SMT”
  2. Another source of data sparsity that occurs in all languages is proper names, which have been handled by using cognates or transliteration to improve translation (Knight and Graehl, 1998; Kondrak et al., 2003; Finch and Sumita, 2007), and more sophisticated methods for named entity translation that combine translation and transliteration have also been proposed (Al-Onaizan and Knight, 2002).
    Page 2, “Related Work on Data Sparsity in SMT”
  3. We have enumerated these related works to demonstrate the myriad of data sparsity problems and proposed solutions.
    Page 2, “Related Work on Data Sparsity in SMT”

See all papers in Proc. ACL 2012 that mention data sparsity.

See all papers in Proc. ACL that mention data sparsity.

Back to top.

LM

Appears in 3 sentences as: LM (3)
In Machine Translation without Words through Substring Alignment
  1. TM (en) 2.80M 3.10M 2.77M 2.13M TM (other) 2.56M 2.23M 3.05M 2.34M LM (en) 16.0M 15.5M 13.8M 11.5M
    Page 6, “Experiments”
  2. LM (other) 15.3M 11.3M 15.6M 11.9M
    Page 6, “Experiments”
  3. Table 1: The number of words in each corpus for TM and LM training, tuning, and testing.
    Page 6, “Experiments”

See all papers in Proc. ACL 2012 that mention LM.

See all papers in Proc. ACL that mention LM.

Back to top.

phrase pair

Appears in 3 sentences as: phrase pair (2) phrase pairs (2)
In Machine Translation without Words through Substring Alignment
  1. Phrasal ITGs are ITGs that allow for non-terminals that can emit phrase pairs with multiple elements on both the source and target sides.
    Page 3, “Alignment Methods”
  2. This probability is the combination of the generative probability of each phrase pair Pt(e:, f3) as well as the sum the probabilities over all shorter spans in straight and inverted order2
    Page 4, “LookAhead Biparsing”
  3. While the Bayesian phrasal ITG framework uses the previously mentioned phrase distribution Pt during search, it also allows for definition of a phrase pair prior probability Ppmofieg, f3), which can efficiently seed the search process with a bias towards phrase pairs that satisfy certain properties.
    Page 5, “Substring Prior Probabilities”

See all papers in Proc. ACL 2012 that mention phrase pair.

See all papers in Proc. ACL that mention phrase pair.

Back to top.