Accurate Word Segmentation using Transliteration and Language Model Projection
Hagiwara, Masato and Sekine, Satoshi

Article Structure

Abstract

Transliterated compound nouns not separated by whitespaces pose difficulty on word segmentation 0f-flz'ne approaches have been proposed to split them using word statistics, but they rely on static lexicon, limiting their use.

Introduction

Accurate word segmentation (WS) is the key components in successful language processing.

Related Work

In Japanese WS, unknown words are usually dealt with in an online manner with the unknown word model, which uses heuristics

Word Segmentation Model

Out baseline model is a semi-Markov structure prediction model which estimates WS and the PoS sequence simultaneously (Kudo et al., 2004; Zhang and Clark, 2008).

Use of Language Model

Language Model Augmentation Analogous to Koehn and Knight (2003), we can exploit the fact that l/‘yF‘ reddo (red) in the example ffiayvnI/W‘ is such a common word that one can expect it appears frequently in the training corpus.

Transliteration

For transliterating J apanese / Chinese words back to English, we adopted the Joint Source Channel (J SC) Model (Li et al., 2004), a generative model widely used as a simple yet powerful baseline in previous research e.g., (Hagi—wara and Sekine, 2012; Finch and Sumita, 2010).2 The JSC model, given an input of source word 3 and target word t, defines the transliteration probability based on transliteration units (TUs) u,- = (31,751) as: PJsc(<s,t>> = H{=1P(uiluz~_n+l,“Wm—1), where f is the number of TUs in a given source / target word pair.

Experiments

6.1 Experimental Settings

Conclusion and Future Works

In this paper, we proposed a novel, online WS model for the Japanese/ Chinese compound word splitting problem, by seam-lessly incorporating the knowledge that back-transliteration of properly segmented words also appear in an English LM.

Topics

LM

Appears in 19 sentences as: LM (22)
In Accurate Word Segmentation using Transliteration and Language Model Projection
  1. We propose an online approach, integrating source LM, and / or, back-transliteration and English LM .
    Page 1, “Abstract”
  2. We refer to this process of transliterating unknown words into another language and using the target LM as LM projection.
    Page 1, “Introduction”
  3. Since the model employs a general transliteration model and a general English LM , it achieves robust WS for unknown words.
    Page 1, “Introduction”
  4. Figure 1: Example lattice with LM projection
    Page 3, “Word Segmentation Model”
  5. As we mentioned in Section 2, English LM knowledge helps split transliterated compounds.
    Page 3, “Use of Language Model”
  6. We use ( LM ) projection, which is a combination of back-transliteration and an English model, by extending the normal lattice building process as follows:
    Page 3, “Use of Language Model”
  7. Then, edges are spanned between these extended English nodes, instead of between the original nodes, by additionally taking into consideration English LM features (ID 21 and 22 in Table 1): ¢iMP(wi) = IOgPWi) and ¢§Mp(wi—1,wi) = log p(wi_1,wi).
    Page 3, “Use of Language Model”
  8. As the English LM , we used Google Web 1T 5-gram Version 1 (Brants and Franz, 2006), limiting it to unigrams occurring more than 2000 times and bigrams occurring more than 500 times.
    Page 3, “Use of Language Model”
  9. We observed slight improvement by incorporating the source LM , and observed a 0.48 point F—value increase over baseline, which translates to 4.65 point Katakana F—value change and 16.0% (3.56% to 2.99 %) WER reduction, mainly due to its higher Katakana word rate (11.2%).
    Page 4, “Experiments”
  10. H. This type of error is reduced by +LM-P, e.g., * 7°?X/fify 7 pumsu chikku “*plus tick” to 703%?‘y7 pumsuchz’kku “plastic” due to LM projection.
    Page 4, “Experiments”
  11. performance, which may be because one cannot limit where the source LM features are applied.
    Page 5, “Experiments”

See all papers in Proc. ACL 2013 that mention LM.

See all papers in Proc. ACL that mention LM.

Back to top.

PoS tags

Appears in 4 sentences as: PoS tag (1) PoS tags (4)
In Accurate Word Segmentation using Transliteration and Language Model Projection
  1. Here, 111,- and wi_1 denote the current and previous word in question, and ti and til are level-j PoS tags assigned to them.
    Page 2, “Word Segmentation Model”
  2. 1The Japanese dictionary and the corpus we used have 6 levels of PoS tag hierarchy, while the Chinese ones have only one level, which is why some of the PoS features are not included in Chinese.
    Page 2, “Word Segmentation Model”
  3. 4Since the dictionary is not explicitly annotated with PoS tags, we firstly took the intersection of the training corpus and the dictionary words, and assigned all the possible PoS tags to the words which appeared in the corpus.
    Page 4, “Experiments”
  4. Proper noun performance for the Stanford segmenter is not shown since it does not assign PoS tags .
    Page 5, “Experiments”

See all papers in Proc. ACL 2013 that mention PoS tags.

See all papers in Proc. ACL that mention PoS tags.

Back to top.

bigram

Appears in 3 sentences as: bigram (2) bigrams (1)
In Accurate Word Segmentation using Transliteration and Language Model Projection
  1. Knowing that the back-transliterated un-igram “blacki” and bigram “blacki shred” are unlikely in English can promote the correct WS, jfiafi‘yi/z/I/yh“ “blackish red”.
    Page 1, “Introduction”
  2. We limit the features to word unigram and bigram features, i'e'7 My) 2 Zil¢1<wi) + ¢2(7~Ui—1awi)l for y = wl...wn.
    Page 2, “Word Segmentation Model”
  3. As the English LM, we used Google Web 1T 5-gram Version 1 (Brants and Franz, 2006), limiting it to unigrams occurring more than 2000 times and bigrams occurring more than 500 times.
    Page 3, “Use of Language Model”

See all papers in Proc. ACL 2013 that mention bigram.

See all papers in Proc. ACL that mention bigram.

Back to top.

Language Model

Appears in 3 sentences as: Language Model (2) language models (1)
In Accurate Word Segmentation using Transliteration and Language Model Projection
  1. Our approach is based on semi-Markov discriminative structure prediction, and it incorporates English back-transliteration and English language models (LMs) into WS in a seamless way.
    Page 1, “Introduction”
  2. Language Model Augmentation Analogous to Koehn and Knight (2003), we can exploit the fact that l/‘yF‘ reddo (red) in the example ffiayvnI/W‘ is such a common word that one can expect it appears frequently in the training corpus.
    Page 3, “Use of Language Model”
  3. 4.1 Language Model Projection
    Page 3, “Use of Language Model”

See all papers in Proc. ACL 2013 that mention Language Model.

See all papers in Proc. ACL that mention Language Model.

Back to top.

proposed models

Appears in 3 sentences as: proposed models (3)
In Accurate Word Segmentation using Transliteration and Language Model Projection
  1. The experiments on Japanese and Chinese WS have shown that the proposed models achieve significant improvement over state-of-the-art, reducing 16% errors in Japanese.
    Page 1, “Abstract”
  2. Table 3 shows the result of the proposed models and major open-source Japanese WS systems, namely, MeCab 0.98 (Kudo et al., 2004), JUMAN 7.0 (Kurohashi and Nagao, 1994),
    Page 4, “Experiments”
  3. Here, MeCab+UniDic achieved slightly better Katakana WS than the proposed models .
    Page 4, “Experiments”

See all papers in Proc. ACL 2013 that mention proposed models.

See all papers in Proc. ACL that mention proposed models.

Back to top.

significant improvement

Appears in 3 sentences as: significant improvement (3)
In Accurate Word Segmentation using Transliteration and Language Model Projection
  1. The experiments on Japanese and Chinese WS have shown that the proposed models achieve significant improvement over state-of-the-art, reducing 16% errors in Japanese.
    Page 1, “Abstract”
  2. The results show that we achieved a significant improvement in WS accuracy in both languages.
    Page 1, “Introduction”
  3. The experimental results show that the model achieves a significant improvement over the baseline and LM augmentation, achieving 16% WER reduction in the EC domain.
    Page 5, “Conclusion and Future Works”

See all papers in Proc. ACL 2013 that mention significant improvement.

See all papers in Proc. ACL that mention significant improvement.

Back to top.

word segmentation

Appears in 3 sentences as: word segmentation (2) word segmenter (1)
In Accurate Word Segmentation using Transliteration and Language Model Projection
  1. Transliterated compound nouns not separated by whitespaces pose difficulty on word segmentation 0f-flz'ne approaches have been proposed to split them using word statistics, but they rely on static lexicon, limiting their use.
    Page 1, “Abstract”
  2. Accurate word segmentation (WS) is the key components in successful language processing.
    Page 1, “Introduction”
  3. For example, when splitting a compound noun 7 fiafi‘y‘ynl/‘yF‘ bumkz’sshureddo, a traditional word segmenter can easily segment this as 7 fififi‘y/‘an/yh“ “*blacki shred” since Val/y F‘ shureddo “shred” is a known, frequent word.
    Page 1, “Introduction”

See all papers in Proc. ACL 2013 that mention word segmentation.

See all papers in Proc. ACL that mention word segmentation.

Back to top.