Lattice Desegmentation for Statistical Machine Translation

Morphological segmentation is an effective sparsity reduction strategy for statistical machine translation (SMT) involving morphologically complex languages.

Morphological segmentation is considered to be indispensable when translating between English and morphologically complex languages such as Arabic.

Translating into morphologically complex languages is a challenging and interesting task that has received much recent attention.

Our goal in this work is to benefit from the sparsity-reducing properties of morphological segmentation while simultaneously allowing the system to reason about the final surface forms of the target language.

We train our English-to-Arabic system using 1.49 million sentence pairs drawn from the NIST 2012 training set, excluding the UN data.

Tables 1 and 2 report results averaged over 5 tuning replications on English-to-Arabic and English-to-Finnish, respectively.

We have explored deeper integration of morphological desegmentation into the statistical machine translation pipeline.

Appears in 16 sentences as: LM (20)

In *Lattice Desegmentation for Statistical Machine Translation*

- Our novel lattice desegmentation algorithm effectively combines both segmented and desegmented Views of the target language for a large subspace of possible translation outputs, which allows for inclusion of features related to the desegmentation process, as well as an unsegmented language model ( LM ).Page 1, “Abstract”
- In order to annotate lattice edges with an n-gram LM , every path coming into a node must end with the same sequence of (n — l) tokens.Page 5, “Methods”
- If this property does not hold, then nodes must be split until it does.4 This property is maintained by the decoder’s recombination rules for the segmented LM, but it is not guaranteed for the desegmented LM .Page 5, “Methods”
- Indeed, the expanded word-level context is one of the main benefits of incorporating a word-level LM .Page 5, “Methods”
- Fortunately, LM annotation as well as any necessary lattice modifications can be performed simultaneously by composing the desegmented lattice with a finite state acceptor encoding the LM (Roark et al., 2011).Page 5, “Methods”
- We compose this acceptor with a desegmenting transducer, and then with an unsegmented LM acceptor, producing a fully annotated, desegmented lattice.Page 5, “Methods”
- Instead of using a tool kit such as OpenFst (Allauzen et al., 2007), we implement both the desegmenting transducer and the LM acceptor programmatically.Page 5, “Methods”
- 4Or the LM composition can be done dynamically, effectively decoding the lattice with a beam or cube-pruned search (Huang and Chiang, 2007).Page 5, “Methods”
- Programmatic LM IntegrationPage 6, “Methods”
- Programmatic composition of a lattice with an n-gram LM acceptor is a well understood problem.Page 6, “Methods”
- With each node corresponding to a single LM context, annotation of outgoing edges with n-gram LM scores is straightforward.Page 6, “Methods”

See all papers in *Proc. ACL 2014* that mention LM.

See all papers in *Proc. ACL* that mention LM.

Back to top.

Appears in 13 sentences as: language model (10) language modeling (1) language models (2)

In *Lattice Desegmentation for Statistical Machine Translation*

- Our novel lattice desegmentation algorithm effectively combines both segmented and desegmented Views of the target language for a large subspace of possible translation outputs, which allows for inclusion of features related to the desegmentation process, as well as an unsegmented language model (LM).Page 1, “Abstract”
- Bojar (2007) incorporates such analyses into a factored model, to either include a language model over target morphological tags, or model the generation of morphological features.Page 2, “Related Work”
- They introduce an additional desegmentation technique that augments the table-based approach with an unsegmented language model .Page 2, “Related Work”
- Oflazer and Durgar El-Kahlout (2007) desegment 1000-best lists for English-to-Turkish translation to enable scoring with an unsegmented language model .Page 2, “Related Work”
- Unlike our work, they replace the segmented language model with the unsegmented one, allowing them to tune the linear model parameters by hand.Page 2, “Related Work”
- We use both segmented and unsegmented language models , and tune automatically to optimize BLEU.Page 2, “Related Work”
- (2010) tune on unsegmented references,1 and translate with both segmented and unsegmented language models for English-to-Finnish translation.Page 2, “Related Work”
- In this setting, the sparsity reduction from segmentation helps word alignment and target language modeling , but it does not result in a more expressive translation model.Page 3, “Related Work”
- This trivially allows for an unsegmented language model and never makes desegmentation errors.Page 3, “Methods”
- Doing so enables the inclusion of an unsegmented target language model , and with a small amount of bookkeeping, it also allows the inclusion of features related to the operations performed during desegmentation (see Section 3.4).Page 3, “Methods”
- We now have a desegmented lattice, but it has not been annotated with an unsegmented (word-level) language model .Page 5, “Methods”

See all papers in *Proc. ACL 2014* that mention language model.

See all papers in *Proc. ACL* that mention language model.

Back to top.

Appears in 13 sentences as: BLEU (17)

In *Lattice Desegmentation for Statistical Machine Translation*

- We use both segmented and unsegmented language models, and tune automatically to optimize BLEU .Page 2, “Related Work”
- (2008) also tune on unsegmented references by simply desegmenting SMT output before MERT collects sufficient statistics for BLEU .Page 2, “Related Work”
- This could improve translation quality, as it brings our training scenario closer to our test scenario (test BLEU is always measured on unsegmented references).Page 3, “Methods”
- We evaluate our system using BLEU (Papineni et al., 2002) and TER (Snover et al., 2006).Page 7, “Experimental Setup”
- For English-to-Arabic, 1-best desegmentation results in a 0.7 BLEU point improvement over training on unsegmented Arabic.Page 8, “Results”
- Moving to lattice desegmentation more than doubles that improvement, resulting in a BLEU score of 34.4 and an improvement of 1.0 BLEU point over 1-best desegmentation.Page 8, “Results”
- 1000-best desegmentation also works well, resulting in a 0.6 BLEU point improvement over 1-best.Page 8, “Results”
- We also tried a similar Morfessor-based segmentation for Arabic, which has an unsegmented test set BLEU of 32.7.Page 8, “Results”
- As in Finnish, the 1-best desegmentation using Morfessor did not surpass the unsegmented baseline, producing a test BLEU of only 31.4 (not shown in Table 1).Page 8, “Results”
- Model Dev Test BLEU BLEU TERPage 8, “Results”
- Model Dev Test BLEU BLEU TER Unsegmented 15 .4 15.1 70.Page 8, “Results”

See all papers in *Proc. ACL 2014* that mention BLEU.

See all papers in *Proc. ACL* that mention BLEU.

Back to top.

Appears in 5 sentences as: translation model (4) translation models (1)

In *Lattice Desegmentation for Statistical Machine Translation*

- Morphological complexity leads to much higher type to token ratios than English, which can create sparsity problems during translation model estimation.Page 1, “Introduction”
- Most techniques approach the problem by transforming the target language in some manner before training the translation model .Page 1, “Related Work”
- In this setting, the sparsity reduction from segmentation helps word alignment and target language modeling, but it does not result in a more expressive translation model .Page 3, “Related Work”
- Four translation model features encode phrase translation probabilities and lexical scores in both directions.Page 7, “Experimental Setup”
- Eventually, we would like to replace the functionality of factored translation models (Koehn and Hoang, 2007) with lattice transformation and augmentation.Page 9, “Conclusion”

See all papers in *Proc. ACL 2014* that mention translation model.

See all papers in *Proc. ACL* that mention translation model.

Back to top.

Appears in 5 sentences as: significant improvement (1) significant improvements (4)

In *Lattice Desegmentation for Statistical Machine Translation*

- We investigate this technique in the context of English-to-Arabic and English-to-Finnish translation, showing significant improvements in translation quality over desegmentation of l-best decoder outputs.Page 1, “Abstract”
- We demonstrate that significant improvements in translation quality can be achieved by training a linear model to re-rank this transformed translation space.Page 1, “Introduction”
- In fact, even with our lattice desegmenter providing a boost, we are unable to see a significant improvement over the unsegmented model.Page 8, “Results”
- Nonetheless, the 1000-best and lattice desegmenters both produce significant improvements over the 1-best desegmentation baseline, with Lattice Deseg achieving a 1-point improvement in TER.Page 8, “Results”
- We have also applied our approach to English-to-Finnish translation, and although segmentation in general does not currently help, we are able to show significant improvements over a 1-best desegmentation baseline.Page 9, “Conclusion”

See all papers in *Proc. ACL 2014* that mention significant improvements.

See all papers in *Proc. ACL* that mention significant improvements.

Back to top.

Appears in 4 sentences as: BLEU point (5)

In *Lattice Desegmentation for Statistical Machine Translation*

- For English-to-Arabic, 1-best desegmentation results in a 0.7 BLEU point improvement over training on unsegmented Arabic.Page 8, “Results”
- Moving to lattice desegmentation more than doubles that improvement, resulting in a BLEU score of 34.4 and an improvement of 1.0 BLEU point over 1-best desegmentation.Page 8, “Results”
- 1000-best desegmentation also works well, resulting in a 0.6 BLEU point improvement over 1-best.Page 8, “Results”
- When applied to English-to-Arabic translation, lattice desegmentation results in a 1.0 BLEU point improvement over one-best desegmentation, and a 1.7 BLEU point improvement over unsegmented translation.Page 9, “Conclusion”

See all papers in *Proc. ACL 2014* that mention BLEU point.

See all papers in *Proc. ACL* that mention BLEU point.

Back to top.

Appears in 4 sentences as: translation quality (4)

In *Lattice Desegmentation for Statistical Machine Translation*

- We investigate this technique in the context of English-to-Arabic and English-to-Finnish translation, showing significant improvements in translation quality over desegmentation of l-best decoder outputs.Page 1, “Abstract”
- We demonstrate that significant improvements in translation quality can be achieved by training a linear model to re-rank this transformed translation space.Page 1, “Introduction”
- This could improve translation quality , as it brings our training scenario closer to our test scenario (test BLEU is always measured on unsegmented references).Page 3, “Methods”
- 7We also experimented on log p(X \Y) as an additional feature, but observed no improvement in translation quality .Page 6, “Experimental Setup”

See all papers in *Proc. ACL 2014* that mention translation quality.

See all papers in *Proc. ACL* that mention translation quality.

Back to top.

Appears in 4 sentences as: Morphological Analysis (1) morphological analysis (2) morphological analyzers (1)

In *Lattice Desegmentation for Statistical Machine Translation*

- The transformation might take the form of a morphological analysis or a morphological segmentation.Page 1, “Related Work”
- 2.1 Morphological AnalysisPage 2, “Related Work”
- Many languages have access to morphological analyzers , which annotate surface forms with their lemmas and morphological features.Page 2, “Related Work”
- In the future, we plan to explore introducing multiple segmentation options into the lattice, and the application of our method to a full morphological analysis (as opposed to segmentation) of the target language.Page 9, “Conclusion”

See all papers in *Proc. ACL 2014* that mention morphological analysis.

See all papers in *Proc. ACL* that mention morphological analysis.

Back to top.

Appears in 4 sentences as: TER (4)

In *Lattice Desegmentation for Statistical Machine Translation*

- We evaluate our system using BLEU (Papineni et al., 2002) and TER (Snover et al., 2006).Page 7, “Experimental Setup”
- Nonetheless, the 1000-best and lattice desegmenters both produce significant improvements over the 1-best desegmentation baseline, with Lattice Deseg achieving a 1-point improvement in TER .Page 8, “Results”
- Model Dev Test BLEU BLEU TERPage 8, “Results”
- Model Dev Test BLEU BLEU TER Unsegmented 15 .4 15.1 70.Page 8, “Results”

See all papers in *Proc. ACL 2014* that mention TER.

See all papers in *Proc. ACL* that mention TER.

Back to top.

Appears in 4 sentences as: segmentations (3) segmenter’s (1)

In *Lattice Desegmentation for Statistical Machine Translation*

- For many segmentations , especially unsupervised ones, this amounts to simple concatenation.Page 2, “Related Work”
- However, more complex segmentations , such as the Arabic tokenization provided by MADA (Habash et al., 2009), require further orthographic adjustments to reverse normalizations performed during segmentation.Page 2, “Related Work”
- where [prefix], [stem] and [suffix] are non-overlapping sets of morphemes, whose members are easily determined using the segmenter’s segment boundary markers.3 The second disjunct of Equation 1 covers words that have no clear stem, such as the Arabic «J lh “for him”, segmented as 1+ “for” +h “him”.Page 4, “Methods”
- To generate the desegmentation table, we analyze the segmentations from the Arabic side of the parallel training data to collect mappings from morpheme sequences to surface forms.Page 7, “Experimental Setup”

See all papers in *Proc. ACL 2014* that mention segmentations.

See all papers in *Proc. ACL* that mention segmentations.

Back to top.

Appears in 3 sentences as: word-level (4)

In *Lattice Desegmentation for Statistical Machine Translation*

- In this section, we discuss how a lattice from a multi-stack phrase-based decoder such as Moses (Koehn et al., 2007) can be desegmented to enable word-level features.Page 3, “Methods”
- We now have a desegmented lattice, but it has not been annotated with an unsegmented ( word-level ) language model.Page 5, “Methods”
- Indeed, the expanded word-level context is one of the main benefits of incorporating a word-level LM.Page 5, “Methods”

See all papers in *Proc. ACL 2014* that mention word-level.

See all papers in *Proc. ACL* that mention word-level.

Back to top.

Appears in 3 sentences as: SMT system (2) SMT system’s (1)

In *Lattice Desegmentation for Statistical Machine Translation*

- Other approaches train an SMT system to predict lemmas instead of surface forms, and then inflect the SMT output as a postprocessing step (Minkov et al., 2007; Clifton and Sarkar, 2011; Fraser et al., 2012; El Kholy and Habash, 2012b).Page 2, “Related Work”
- We approach this problem by augmenting an SMT system built over target segments with features that reflect the desegmented target words.Page 3, “Methods”
- In this section, we describe our various strategies for desegmenting the SMT system’s output space, along with the features that we add to take advantage of this desegmented view.Page 3, “Methods”

See all papers in *Proc. ACL 2014* that mention SMT system.

See all papers in *Proc. ACL* that mention SMT system.

Back to top.

Appears in 3 sentences as: regular expression (2) regular expression: (1)

In *Lattice Desegmentation for Statistical Machine Translation*

- We define a word using the following regular expression:Page 4, “Methods”
- Equation 1 may need to be modified for other languages or segmentation schemes, but our techniques generalize to any definition that can be written as a regular expression .Page 4, “Methods”
- A chain is valid if it emits the beginning of a word as defined by the regular expression in Equation 1.Page 5, “Methods”

See all papers in *Proc. ACL 2014* that mention regular expression.

See all papers in *Proc. ACL* that mention regular expression.

Back to top.

Appears in 3 sentences as: phrase-based (3)

In *Lattice Desegmentation for Statistical Machine Translation*

- In this section, we discuss how a lattice from a multi-stack phrase-based decoder such as Moses (Koehn et al., 2007) can be desegmented to enable word-level features.Page 3, “Methods”
- A phrase-based decoder produces its output from left to right, with each operation appending the translation of a source phrase to a growing target hypothesis.Page 3, “Methods”
- The search graph of a phrase-based decoder can be interpreted as a lattice, which can be interpreted as a finite state acceptor over target strings.Page 4, “Methods”

See all papers in *Proc. ACL 2014* that mention phrase-based.

See all papers in *Proc. ACL* that mention phrase-based.

Back to top.

Appears in 3 sentences as: phrase table (2) phrase tables (1)

In *Lattice Desegmentation for Statistical Machine Translation*

- Alternatively, one can reparame-terize existing phrase tables as exponential models, so that translation probabilities account for source context and morphological features (Jeong et al., 2010; Subotin, 2011).Page 2, “Related Work”
- This is sometimes referred to as a word graph (Ueffing et al., 2002), although in our case the segmented phrase table also produces tokens that correspond to morphemes.Page 4, “Methods”
- (2010) address this problem by forcing the decoder’s phrase table to respect word boundaries, guaranteeing that each de-segmentable token sequence is local to an edge.Page 5, “Methods”

See all papers in *Proc. ACL 2014* that mention phrase table.

See all papers in *Proc. ACL* that mention phrase table.

Back to top.

Appears in 3 sentences as: parallel data (3)

In *Lattice Desegmentation for Statistical Machine Translation*

- With word-boundary-aware phrase extraction, a phrase pair containing all of “with his blue car” must have been seen in the parallel data to translate the phrase correctly at test time.Page 3, “Related Work”
- We align the parallel data with GIZA++ (Och et al., 2003) and decode using Moses (Koehn et al., 2007).Page 7, “Experimental Setup”
- A KN-smoothed 5-gram language model is trained on the target side of the parallel data with SRILM (Stolcke, 2002).Page 7, “Experimental Setup”

See all papers in *Proc. ACL 2014* that mention parallel data.

See all papers in *Proc. ACL* that mention parallel data.

Back to top.

Appears in 3 sentences as: NIST (4)

In *Lattice Desegmentation for Statistical Machine Translation*

- We train our English-to-Arabic system using 1.49 million sentence pairs drawn from the NIST 2012 training set, excluding the UN data.Page 6, “Experimental Setup”
- We tune on the NIST 2004 evaluation set (1353 sentences) and evaluate on NIST 2005 (1056 sentences).Page 7, “Experimental Setup”
- Judging from the output on the NIST 2005 test set, the system uses these discontiguous desegmentations very rarely: only 5% of desegmented tokens align to discontiguous source phrases.Page 8, “Results”

See all papers in *Proc. ACL 2014* that mention NIST.

See all papers in *Proc. ACL* that mention NIST.

Back to top.

Appears in 3 sentences as: n-gram (3)

In *Lattice Desegmentation for Statistical Machine Translation*

- In order to annotate lattice edges with an n-gram LM, every path coming into a node must end with the same sequence of (n — l) tokens.Page 5, “Methods”
- Programmatic composition of a lattice with an n-gram LM acceptor is a well understood problem.Page 6, “Methods”
- With each node corresponding to a single LM context, annotation of outgoing edges with n-gram LM scores is straightforward.Page 6, “Methods”

See all papers in *Proc. ACL 2014* that mention n-gram.

See all papers in *Proc. ACL* that mention n-gram.

Back to top.

Appears in 3 sentences as: log-linear model (3)

In *Lattice Desegmentation for Statistical Machine Translation*

- The decoder’s log-linear model includes a standard feature set.Page 7, “Experimental Setup”
- The decoder’s log-linear model is tuned with MERT (Och, 2003).Page 7, “Experimental Setup”
- Both the decoder’s log-linear model and the re-ranking models are trained on the same development set.Page 7, “Experimental Setup”

See all papers in *Proc. ACL 2014* that mention log-linear model.

See all papers in *Proc. ACL* that mention log-linear model.

Back to top.

Appears in 3 sentences as: log-linear (3)

In *Lattice Desegmentation for Statistical Machine Translation*

- The decoder’s log-linear model includes a standard feature set.Page 7, “Experimental Setup”
- The decoder’s log-linear model is tuned with MERT (Och, 2003).Page 7, “Experimental Setup”
- Both the decoder’s log-linear model and the re-ranking models are trained on the same development set.Page 7, “Experimental Setup”

See all papers in *Proc. ACL 2014* that mention log-linear.

See all papers in *Proc. ACL* that mention log-linear.

Back to top.