A Joint Graph Model for Pinyin-to-Chinese Conversion with Typo Correction
Jia, Zhongye and Zhao, Hai

Article Structure

Abstract

It is very import for Chinese language processing with the aid of an efficient input method engine (IME), of which pinyin-to-Chinese (PTC) conversion is the core part.

Introduction

1.1 Chinese Input Method

Related Works

The very first approach for Chinese input with typo correction was made by (Chen and Lee, 2000), which was also the initial attempt of “sentence-based” IME.

Pinyin Input Method Model

3.1 From English Letter to Chinese Sentence

Experiments

4.1 Corpora, Tools and Experiment Settings

Conclusion

In this paper, we have developed a joint graph model for pinyin-to-Chinese conversion with typo correction.

Topics

IME

Appears in 33 sentences as: IME (26) IMEs (9)
In A Joint Graph Model for Pinyin-to-Chinese Conversion with Typo Correction
  1. It is very import for Chinese language processing with the aid of an efficient input method engine ( IME ), of which pinyin-to-Chinese (PTC) conversion is the core part.
    Page 1, “Abstract”
  2. Meanwhile, though typos are inevitable during user pinyin inputting, existing IMEs paid little attention to such big inconvenience.
    Page 1, “Abstract”
  3. In this paper, motivated by a key equivalence of two decoding algorithms, we propose a joint graph model to globally optimize PTC and typo correction for IME .
    Page 1, “Abstract”
  4. The evaluation results show that the proposed method outperforms both existing academic and commercial IMEs .
    Page 1, “Abstract”
  5. The daily life of Chinese people heavily depends on Chinese input method engine ( IME ), no matter whether one is composing an Email, writing an article, or sending a text message.
    Page 1, “Introduction”
  6. However, every Chinese word inputted into computer or cellphone cannot be typed through one-to-one mapping of key-to-letter inputting directly, but has to go through an IME as there are thousands of Chinese characters for inputting while only 26 letter keys are available in the keyboard.
    Page 1, “Introduction”
  7. An IME is an essential software interface that maps Chinese characters into English letter combinations.
    Page 1, “Introduction”
  8. ficient IME will largely improve the user experience of Chinese information processing.
    Page 1, “Introduction”
  9. Nowadays most of Chinese IMEs are pinyin based.
    Page 1, “Introduction”
  10. The advantage of pinyin IME is that it only adopts the pronunciation perspective of Chinese characters so that it is simple and easy to learn.
    Page 1, “Introduction”
  11. Modern pinyin IMEs mostly use a “sentence-based” decoding technique (Chen and Lee, 2000) to alleviate the ambiguities.
    Page 1, “Introduction”

See all papers in Proc. ACL 2014 that mention IME.

See all papers in Proc. ACL that mention IME.

Back to top.

graph model

Appears in 11 sentences as: Graph Model (1) Graph model (2) graph model (8)
In A Joint Graph Model for Pinyin-to-Chinese Conversion with Typo Correction
  1. In this paper, motivated by a key equivalence of two decoding algorithms, we propose a joint graph model to globally optimize PTC and typo correction for IME.
    Page 1, “Abstract”
  2. Various approaches were made for the task including language model (LM) based methods (Chen et al., 2013), ME model (Han and Chang, 2013), CRF (Wang et al., 2013d; Wang et al., 2013a), SMT (Chiu et al., 2013; Liu et al., 2013), and graph model (Jia et al., 2013), etc.
    Page 2, “Related Works”
  3. Inspired by (Yang et al., 2012b) and (Jia et al., 2013), we adopt the graph model for Chinese spell checking for pinyin segmentation and typo correction, which is based on the shortest path word segmentation algorithm (Casey and Lecolinet, 1996).
    Page 3, “Pinyin Input Method Model”
  4. Figure 2: Graph model for pinyin segmentation
    Page 4, “Pinyin Input Method Model”
  5. Figure 3: Graph model for pinyin typo correction
    Page 4, “Pinyin Input Method Model”
  6. 3.4 Joint Graph Model For Pinyin IME
    Page 5, “Pinyin Input Method Model”
  7. Given HMM decoding problem is identical to SSSP problem on DAG, we propose a joint graph model for PTC conversion with typo correction.
    Page 5, “Pinyin Input Method Model”
  8. The joint graph model aims to find the global optimal for both PTC conversion and typo correction on the entire input pinyin sequence.
    Page 5, “Pinyin Input Method Model”
  9. Figure 4: Joint graph model
    Page 6, “Pinyin Input Method Model”
  10. Figure 6: Filtered graph model
    Page 6, “Pinyin Input Method Model”
  11. In this paper, we have developed a joint graph model for pinyin-to-Chinese conversion with typo correction.
    Page 9, “Conclusion”

See all papers in Proc. ACL 2014 that mention graph model.

See all papers in Proc. ACL that mention graph model.

Back to top.

Chinese word

Appears in 10 sentences as: Chinese word (6) Chinese words (5)
In A Joint Graph Model for Pinyin-to-Chinese Conversion with Typo Correction
  1. However, every Chinese word inputted into computer or cellphone cannot be typed through one-to-one mapping of key-to-letter inputting directly, but has to go through an IME as there are thousands of Chinese characters for inputting while only 26 letter keys are available in the keyboard.
    Page 1, “Introduction”
  2. Without word delimiters, linguists have argued on what a Chinese word really is for a long time and that is why there is always a primary word segmentation treatment in most Chinese language processing tasks (Zhao et al., 2006; Huang and Zhao, 2007; Zhao and Kit, 2008; Zhao et al., 2010; Zhao and Kit, 2011; Zhao et al., 2013).
    Page 3, “Pinyin Input Method Model”
  3. A Chinese word may contain from 1 to over 10 characters due to different word segmentation conventions.
    Page 3, “Pinyin Input Method Model”
  4. Nevertheless, pinyin syllable segmentation is a much easier problem compared to Chinese word segmentation.
    Page 3, “Pinyin Input Method Model”
  5. In the HMM for pinyin IME, observation states are pinyin syllables, hidden states are Chinese words , emission probability is P(si|wi), and transition probability is P(wi|wi_1).
    Page 5, “Pinyin Input Method Model”
  6. PTC conversion is to decode the Chinese word sequence from the pinyin sequence.
    Page 5, “Pinyin Input Method Model”
  7. Corresponding Chinese words are fetched from a PTC dictionary lDDC, which is a dictionary maps pinyin words to Chinese words , and added as vertices:
    Page 5, “Pinyin Input Method Model”
  8. We will also report the conversion error rate (ConvER) proposed by (Zheng et al., 2011a), which is the ratio of the number of mistyped pinyin word that is not converted to the right Chinese word over the total number of mistyped pinyin words3.
    Page 7, “Experiments”
  9. According to our empirical observation, emission probabilities are mostly 1 since most Chinese words have unique pronunciation.
    Page 7, “Experiments”
  10. , 201 1a) performed an experiment that 2,000 sentences of 11,968 Chinese words were entered by 5 native speakers.
    Page 8, “Experiments”

See all papers in Proc. ACL 2014 that mention Chinese word.

See all papers in Proc. ACL that mention Chinese word.

Back to top.

baseline system

Appears in 8 sentences as: Baseline System (1) Baseline system (1) baseline system (6)
In A Joint Graph Model for Pinyin-to-Chinese Conversion with Typo Correction
  1. 4.3 Baseline System without Typo Correction
    Page 7, “Experiments”
  2. Firstly we build a baseline system without typo correction which is a pipeline of pinyin syllable segmentation and PTC conversion.
    Page 7, “Experiments”
  3. The baseline system takes a pinyin input sequence, segments it into syllables, and then converts it to Chinese character sequence.
    Page 7, “Experiments”
  4. We compare our baseline system with several practical pinyin IMEs including sunpinyin and Google Input Tools (Online version)4.
    Page 7, “Experiments”
  5. Table 2: Baseline system compared to other IMEs (%)
    Page 8, “Experiments”
  6. Based upon the baseline system , we build the joint system of PTC conversion with typo correction.
    Page 8, “Experiments”
  7. Our results are compared to the baseline system without typo correction and Google Input Tool.
    Page 9, “Experiments”
  8. Since sunpinyin does not have typo correction module and performs much poorer than our baseline system , we do not include it in the comparison.
    Page 9, “Experiments”

See all papers in Proc. ACL 2014 that mention baseline system.

See all papers in Proc. ACL that mention baseline system.

Back to top.

conditional probability

Appears in 7 sentences as: conditional probabilities (1) conditional probability (6)
In A Joint Graph Model for Pinyin-to-Chinese Conversion with Typo Correction
  1. They solved the typo correction problem by decomposing the conditional probability P(H |P) of Chinese character sequence H given pinyin sequence P into a language model P(wi|wi_1) and a typing model The typing model that was estimated on real user input data was for typo correction.
    Page 2, “Related Works”
  2. The edge weight the negative logarithm of conditional probability P(Sj+1,k SM) that a syllable Sm- is followed by Sj+1,k, which is give by a bigram language model of pinyin syllables:
    Page 3, “Pinyin Input Method Model”
  3. Similar to G8 , the edges are from one syllable to all syllables next to it and edge weights are the conditional probabilities between them.
    Page 4, “Pinyin Input Method Model”
  4. Thus the conditional probability between characters does not make much sense.
    Page 4, “Pinyin Input Method Model”
  5. In addition, a pinyin syllable usually maps to dozens or even hundreds of corresponding homophonic characters, which makes the conditional probability between syllables much more noisy.
    Page 4, “Pinyin Input Method Model”
  6. The best Chinese character sequence W* for a given pinyin syllable sequence 8 is the one with the highest conditional probability P(W|S) that
    Page 4, “Pinyin Input Method Model”
  7. Note the transition probability is the conditional probability between words instead of characters.
    Page 5, “Pinyin Input Method Model”

See all papers in Proc. ACL 2014 that mention conditional probability.

See all papers in Proc. ACL that mention conditional probability.

Back to top.

LM

Appears in 7 sentences as: LM (8)
In A Joint Graph Model for Pinyin-to-Chinese Conversion with Typo Correction
  1. Various approaches were made for the task including language model ( LM ) based methods (Chen et al., 2013), ME model (Han and Chang, 2013), CRF (Wang et al., 2013d; Wang et al., 2013a), SMT (Chiu et al., 2013; Liu et al., 2013), and graph model (Jia et al., 2013), etc.
    Page 2, “Related Works”
  2. WE(W,j—>Vj+1,k) : _10g P(Vj+1vk Vivi) Although the model is formulated on first order HMM, i.e., the LM used for transition probability is a bigram one, it is easy to extend the model to take advantage of higher order n-gram LM , by tracking longer history while traversing the graph.
    Page 6, “Pinyin Input Method Model”
  3. The Chinese part of the corpus is segmented into words before LM training.
    Page 6, “Experiments”
  4. The pinyin syllable segmentation already has very high (over 98%) accuracy with a trigram LM using improved Kneser-Ney smoothing.
    Page 7, “Experiments”
  5. We consider different LM smoothing methods including Kneser-Ney (KN), improved Kneser-Ney (IKN), and Witten-Bell (WB).
    Page 7, “Experiments”
  6. According to the results, we then choose the trigram LM using Kneser-Ney smoothing with interpolation.
    Page 7, “Experiments”
  7. Figure 7: MIU-Acc and Ch-Acc with different LM smoothing
    Page 7, “Experiments”

See all papers in Proc. ACL 2014 that mention LM.

See all papers in Proc. ACL that mention LM.

Back to top.

edge weight

Appears in 6 sentences as: edge weight (3) edge weights (3)
In A Joint Graph Model for Pinyin-to-Chinese Conversion with Typo Correction
  1. The edge weight the negative logarithm of conditional probability P(Sj+1,k SM) that a syllable Sm- is followed by Sj+1,k, which is give by a bigram language model of pinyin syllables:
    Page 3, “Pinyin Input Method Model”
  2. Similar to G8 , the edges are from one syllable to all syllables next to it and edge weights are the conditional probabilities between them.
    Page 4, “Pinyin Input Method Model”
  3. 0 Edges from the start vertex E (210 —> 21351) with edge weight
    Page 5, “Pinyin Input Method Model”
  4. with edge weight
    Page 5, “Pinyin Input Method Model”
  5. The shortest path P* from 210 to 21E is the one with the least sum of vertex and edge weights , i.e.,
    Page 5, “Pinyin Input Method Model”
  6. The edge weights are the negative logarithm of the transition probabilities:
    Page 6, “Pinyin Input Method Model”

See all papers in Proc. ACL 2014 that mention edge weight.

See all papers in Proc. ACL that mention edge weight.

Back to top.

joint model

Appears in 6 sentences as: joint model (6)
In A Joint Graph Model for Pinyin-to-Chinese Conversion with Typo Correction
  1. As we will propose a joint model
    Page 2, “Related Works”
  2. To make typo correction better, we consider to integrate it with FTC conversion using a joint model .
    Page 4, “Pinyin Input Method Model”
  3. 3'Other evaluation metrics are also proposed by (Zheng et al., 2011a) which is only suitable for their system since our system uses a joint model
    Page 7, “Experiments”
  4. The selection of K also directly guarantees the running time of the joint model .
    Page 8, “Experiments”
  5. using the proposed joint model are shown in Table 3 and Table 4.
    Page 9, “Experiments”
  6. In addition, the joint model is efficient enough for practical use.
    Page 9, “Conclusion”

See all papers in Proc. ACL 2014 that mention joint model.

See all papers in Proc. ACL that mention joint model.

Back to top.

word segmentation

Appears in 6 sentences as: word segmentation (6)
In A Joint Graph Model for Pinyin-to-Chinese Conversion with Typo Correction
  1. Without word delimiters, linguists have argued on what a Chinese word really is for a long time and that is why there is always a primary word segmentation treatment in most Chinese language processing tasks (Zhao et al., 2006; Huang and Zhao, 2007; Zhao and Kit, 2008; Zhao et al., 2010; Zhao and Kit, 2011; Zhao et al., 2013).
    Page 3, “Pinyin Input Method Model”
  2. A Chinese word may contain from 1 to over 10 characters due to different word segmentation conventions.
    Page 3, “Pinyin Input Method Model”
  3. Nevertheless, pinyin syllable segmentation is a much easier problem compared to Chinese word segmentation .
    Page 3, “Pinyin Input Method Model”
  4. Inspired by (Yang et al., 2012b) and (Jia et al., 2013), we adopt the graph model for Chinese spell checking for pinyin segmentation and typo correction, which is based on the shortest path word segmentation algorithm (Casey and Lecolinet, 1996).
    Page 3, “Pinyin Input Method Model”
  5. However, using pinyin words instead of syllables is not a wise choice because pinyin word segmentation is not so easy a task as syllable segmentation.
    Page 4, “Pinyin Input Method Model”
  6. Maximum matching word segmentation is used with a large word vocabulary V extracted from web data provided by (Wang et al., 2013b).
    Page 6, “Experiments”

See all papers in Proc. ACL 2014 that mention word segmentation.

See all papers in Proc. ACL that mention word segmentation.

Back to top.

language model

Appears in 4 sentences as: language model (5)
In A Joint Graph Model for Pinyin-to-Chinese Conversion with Typo Correction
  1. They solved the typo correction problem by decomposing the conditional probability P(H |P) of Chinese character sequence H given pinyin sequence P into a language model P(wi|wi_1) and a typing model The typing model that was estimated on real user input data was for typo correction.
    Page 2, “Related Works”
  2. Various approaches were made for the task including language model (LM) based methods (Chen et al., 2013), ME model (Han and Chang, 2013), CRF (Wang et al., 2013d; Wang et al., 2013a), SMT (Chiu et al., 2013; Liu et al., 2013), and graph model (Jia et al., 2013), etc.
    Page 2, “Related Works”
  3. The edge weight the negative logarithm of conditional probability P(Sj+1,k SM) that a syllable Sm- is followed by Sj+1,k, which is give by a bigram language model of pinyin syllables:
    Page 3, “Pinyin Input Method Model”
  4. SRILM (Stolcke, 2002) is adopted for language model training and KenLM (Heafield, 2011; Heafield et al., 2013) for language model query.
    Page 6, “Experiments”

See all papers in Proc. ACL 2014 that mention language model.

See all papers in Proc. ACL that mention language model.

Back to top.

bigram

Appears in 3 sentences as: bigram (3)
In A Joint Graph Model for Pinyin-to-Chinese Conversion with Typo Correction
  1. The edge weight the negative logarithm of conditional probability P(Sj+1,k SM) that a syllable Sm- is followed by Sj+1,k, which is give by a bigram language model of pinyin syllables:
    Page 3, “Pinyin Input Method Model”
  2. WE(W,j—>Vj+1,k) : _10g P(Vj+1vk Vivi) Although the model is formulated on first order HMM, i.e., the LM used for transition probability is a bigram one, it is easy to extend the model to take advantage of higher order n-gram LM, by tracking longer history while traversing the graph.
    Page 6, “Pinyin Input Method Model”
  3. All of the three smoothing methods for bigram and trigram LMs are examined both using back-off mod-
    Page 7, “Experiments”

See all papers in Proc. ACL 2014 that mention bigram.

See all papers in Proc. ACL that mention bigram.

Back to top.

evaluation metrics

Appears in 3 sentences as: Evaluation Metrics (1) evaluation metrics (2)
In A Joint Graph Model for Pinyin-to-Chinese Conversion with Typo Correction
  1. 4.2 Evaluation Metrics
    Page 7, “Experiments”
  2. We will use conventional sequence labeling evaluation metrics such as sequence accuracy and character accuracy2.
    Page 7, “Experiments”
  3. 3'Other evaluation metrics are also proposed by (Zheng et al., 2011a) which is only suitable for their system since our system uses a joint model
    Page 7, “Experiments”

See all papers in Proc. ACL 2014 that mention evaluation metrics.

See all papers in Proc. ACL that mention evaluation metrics.

Back to top.

Viterbi

Appears in 3 sentences as: Viterbi (5)
In A Joint Graph Model for Pinyin-to-Chinese Conversion with Typo Correction
  1. The idea of “statistical input method” was proposed by modeling PTC conversion as a hidden Markov model (HMM), and using Viterbi ( Viterbi , 1967) algorithm to decode the sequence.
    Page 2, “Related Works”
  2. The Viterbi algorithm ( Viterbi , 1967) is used for the decoding.
    Page 5, “Pinyin Input Method Model”
  3. The shortest path algorithm for typo correction and Viterbi algorithm for PTC conversion are very closely related.
    Page 5, “Pinyin Input Method Model”

See all papers in Proc. ACL 2014 that mention Viterbi.

See all papers in Proc. ACL that mention Viterbi.

Back to top.