An Error-Driven Word-Character Hybrid Model for Joint Chinese Word Segmentation and POS Tagging
Kruengkrai, Canasai and Uchimoto, Kiyotaka and Kazama, Jun'ichi and Wang, Yiou and Torisawa, Kentaro and Isahara, Hitoshi

Article Structure

Abstract

In this paper, we present a discriminative word-character hybrid model for joint Chinese word segmentation and POS tagging.

Introduction

In Chinese, word segmentation and part-of-speech (POS) tagging are indispensable steps for higher-level NLP tasks.

Background

2.1 Problem formation

Policies for correct path selection

In this section, we describe our strategies for selecting the correct path yt in the training phase.

Training method

4.1 Discriminative online learning

Experiments

5.1 Data sets

Related work

In this section, we discuss related approaches based on several aspects of learning algorithms and search space representation methods.

Conclusion

In this paper, we presented a discriminative word-character hybrid model for joint Chinese word segmentation and POS tagging.

Topics

POS tagging

Appears in 26 sentences as: (1) POS tag (3) POS tagging (13) POS tags (9)
In An Error-Driven Word-Character Hybrid Model for Joint Chinese Word Segmentation and POS Tagging
  1. In this paper, we present a discriminative word-character hybrid model for joint Chinese word segmentation and POS tagging .
    Page 1, “Abstract”
  2. Word segmentation and POS tagging results are required as inputs to other NLP tasks, such as phrase chunking, dependency parsing, and machine translation.
    Page 1, “Introduction”
  3. Word segmentation and POS tagging in a joint process have received much attention in recent research and have shown improvements over a pipelined fashion (Ng and Low, 2004; Nakagawa and Uchimoto, 2007; Zhang and Clark, 2008; Jiang et al., 2008a; Jiang et al., 2008b).
    Page 1, “Introduction”
  4. In joint word segmentation and the POS tagging process, one serious problem is caused by unknown words, which are defined as words that are not found in a training corpus or in a sys-
    Page 1, “Introduction”
  5. The word boundaries and the POS tags of unknown words, which are very difficult to identify, cause numerous errors.
    Page 1, “Introduction”
  6. In joint word segmentation and the POS tagging process, the task is to predict a path
    Page 2, “Background”
  7. p is its POS tag , and a “7%” symbol denotes the number of elements in each variable.
    Page 2, “Background”
  8. words found in the system’s word dictionary, have regular POS tags .
    Page 2, “Background”
  9. Character-level nodes have special tags where position-of-character (FCC) and POS tags are combined (Asahara, 2003; Nakagawa, 2004).
    Page 2, “Background”
  10. We can directly estimate the statistics of known words from an annotated corpus where a sentence is already segmented into words and assigned POS tags .
    Page 3, “Policies for correct path selection”
  11. 3We consider a word and its POS tag a single entry.
    Page 3, “Policies for correct path selection”

See all papers in Proc. ACL 2009 that mention POS tagging.

See all papers in Proc. ACL that mention POS tagging.

Back to top.

word segmentation

Appears in 14 sentences as: (1) Word segmentation (2) word segmentation (12)
In An Error-Driven Word-Character Hybrid Model for Joint Chinese Word Segmentation and POS Tagging
  1. In this paper, we present a discriminative word-character hybrid model for joint Chinese word segmentation and POS tagging.
    Page 1, “Abstract”
  2. In Chinese, word segmentation and part-of-speech (POS) tagging are indispensable steps for higher-level NLP tasks.
    Page 1, “Introduction”
  3. Word segmentation and POS tagging results are required as inputs to other NLP tasks, such as phrase chunking, dependency parsing, and machine translation.
    Page 1, “Introduction”
  4. Word segmentation and POS tagging in a joint process have received much attention in recent research and have shown improvements over a pipelined fashion (Ng and Low, 2004; Nakagawa and Uchimoto, 2007; Zhang and Clark, 2008; Jiang et al., 2008a; Jiang et al., 2008b).
    Page 1, “Introduction”
  5. In joint word segmentation and the POS tagging process, one serious problem is caused by unknown words, which are defined as words that are not found in a training corpus or in a sys-
    Page 1, “Introduction”
  6. In joint word segmentation and the POS tagging process, the task is to predict a path
    Page 2, “Background”
  7. 4In our experiments, the optimal threshold value 7" is selected by evaluating the performance of joint word segmentation and POS tagging on the development set.
    Page 3, “Policies for correct path selection”
  8. Previous studies on joint Chinese word segmentation and POS tagging have used Penn Chinese Treebank (CTB) (Xia et al., 2000) in experiments.
    Page 6, “Experiments”
  9. We evaluated both word segmentation (Seg) and joint word segmentation and POS tagging (Seg & Tag).
    Page 6, “Experiments”
  10. (2008a; 2008b) on CTB 5.0 and Zhang and Clark (2008) on CTB 4.0 since they reported the best performances on joint word segmentation and POS tagging using the training materials only derived from the corpora.
    Page 7, “Experiments”
  11. Maximum entropy models are widely used for word segmentation and POS tagging tasks (Uchimoto et al., 2001; Ng and Low, 2004; Nakagawa, 2004; Nakagawa and Uchimoto, 2007) since they only need moderate training times while they provide reasonable performance.
    Page 8, “Related work”

See all papers in Proc. ACL 2009 that mention word segmentation.

See all papers in Proc. ACL that mention word segmentation.

Back to top.

word-level

Appears in 12 sentences as: Word-level (1) word-level (11)
In An Error-Driven Word-Character Hybrid Model for Joint Chinese Word Segmentation and POS Tagging
  1. In the hybrid model, given an input sentence, a lattice that consists of word-level and character-level nodes is constructed.
    Page 2, “Background”
  2. Word-level nodes, which correspond to
    Page 2, “Background”
  3. In other words, we use word-level nodes to identify known words and character-level nodes to identify unknown words.
    Page 2, “Background”
  4. Ideally, we need to build a word-character hybrid model that effectively learns the characteristics of unknown words (with character-level nodes) as well as those of known words (with word-level nodes).
    Page 3, “Policies for correct path selection”
  5. If we select the correct path yt that corresponds to the annotated sentence, it will only consist of word-level nodes that do not allow learning for unknown words.
    Page 3, “Policies for correct path selection”
  6. We therefore need to choose character-level nodes as correct nodes instead of word-level nodes for some words.
    Page 3, “Policies for correct path selection”
  7. As a result, the correct path yt can contain both word-level and character-level nodes (marked with asterisks (*)).
    Page 3, “Policies for correct path selection”
  8. W0 <w0> for word-level W1 <p0> nodes
    Page 5, “Training method”
  9. Templates W0—W3 are basic word-level unigram features, where Length(w0) denotes the length of the word wo.
    Page 5, “Training method”
  10. Templates B0-B9 are basic word-level bigram features.
    Page 5, “Training method”
  11. Since we were interested in finding an optimal combination of word-level and character-level nodes for training, we focused on tuning 7“.
    Page 6, “Experiments”

See all papers in Proc. ACL 2009 that mention word-level.

See all papers in Proc. ACL that mention word-level.

Back to top.

bigrams

Appears in 9 sentences as: Bigram (2) bigram (3) bigrams (5)
In An Error-Driven Word-Character Hybrid Model for Joint Chinese Word Segmentation and POS Tagging
  1. We broadly classify features into two categories: unigram and bigram features.
    Page 5, “Training method”
  2. TBO (TE (w_1)) T31 <TE(w—1)7P0> T32 <TE(w—1),P—1,P0> TB3 (TE(w_1),TB(w0)) <TE(w—1)7T3(w0)7p0> <TE(w—1)7p—17T3(w0)> T36 <TE(w—1)7p—17TB(w0)7p0> CBO <p_1, p0) otherwise Table 3: Bigram features.
    Page 5, “Training method”
  3. Bigram features: Table 3 shows our bigram features.
    Page 5, “Training method”
  4. Templates B0-B9 are basic word-level bigram features.
    Page 5, “Training method”
  5. These features aim to capture all the possible combinations of word and POS bigrams .
    Page 5, “Training method”
  6. Templates TBO-TB6 are the types of characters for bigrams .
    Page 5, “Training method”
  7. Note that if one of the adjacent nodes is a character-level node, we use the template CBO that represents POS bigrams .
    Page 5, “Training method”
  8. In our preliminary experiments, we found that if we add more features to non-word-level bigrams , the number of features grows rapidly due to the dense connections between non-word-level nodes.
    Page 5, “Training method”
  9. However, these features only slightly improve performance over using simple POS bigrams .
    Page 5, “Training method”

See all papers in Proc. ACL 2009 that mention bigrams.

See all papers in Proc. ACL that mention bigrams.

Back to top.

cross validation

Appears in 8 sentences as: cross validation (9)
In An Error-Driven Word-Character Hybrid Model for Joint Chinese Word Segmentation and POS Tagging
  1. 0 Divide the training corpus into ten equal sets and perform 10-fold cross validation to find the errors.
    Page 3, “Policies for correct path selection”
  2. After ten cross validation runs, we get a list of the unidentified unknown words derived from the whole training corpus.
    Page 3, “Policies for correct path selection”
  3. Note that the unidentified unknown words in the cross validation are not necessary to be infrequent words, but some overlap may exist.
    Page 3, “Policies for correct path selection”
  4. Finally, we obtain the artificial unknown words that combine the unidentified unknown words in cross validation and infrequent words for learning unknown words.
    Page 3, “Policies for correct path selection”
  5. For the error-driven policy, we collected unidentified unknown words using 10-fold cross validation on the training set, as previously described in Section 3.
    Page 6, “Experiments”
  6. Table 9: Comparison of averaged F1 results (by 10-fold cross validation ) with previous studies on CTB 3.0.
    Page 7, “Experiments”
  7. Unfortunately, Zhang and Clark’s experimental setting did not allow us to use our error-driven policy since performing 10-fold cross validation again on each main cross validation trial is computationally too expensive.
    Page 8, “Experiments”
  8. Therefore, we used our baseline policy in this setting and fixed 7“ = 3 for all cross validation runs.
    Page 8, “Experiments”

See all papers in Proc. ACL 2009 that mention cross validation.

See all papers in Proc. ACL that mention cross validation.

Back to top.

learning algorithm

Appears in 8 sentences as: Learning Algorithm (1) learning algorithm (6) learning algorithms (1)
In An Error-Driven Word-Character Hybrid Model for Joint Chinese Word Segmentation and POS Tagging
  1. The goal of our learning algorithm is to learn a mapping from inputs (unsegmented sentences) x E X to outputs (segmented paths) y E 3?
    Page 2, “Background”
  2. Therefore, we require a learning algorithm that can efficiently handle large and complex lattice structures.
    Page 3, “Training method”
  3. Algorithm 1 Generic Online Learning Algorithm
    Page 4, “Training method”
  4. Algorithm 1 outlines the generic online learning algorithm (McDonald, 2006) used in our framework.
    Page 4, “Training method”
  5. We focus on an online learning algorithm called MIRA (Crammer, 2004), which has the desired accuracy and scalability properties.
    Page 4, “Training method”
  6. In this section, we discuss related approaches based on several aspects of learning algorithms and search space representation methods.
    Page 8, “Related work”
  7. Our approach overcomes the limitation of the original hybrid model by a discriminative online learning algorithm for training.
    Page 8, “Related work”
  8. The second is a discriminative online learning algorithm based on MIRA that enables us to incorporate arbitrary features to our hybrid model.
    Page 8, “Conclusion”

See all papers in Proc. ACL 2009 that mention learning algorithm.

See all papers in Proc. ACL that mention learning algorithm.

Back to top.

Seg

Appears in 8 sentences as: Seg (13) |Seg (1)
In An Error-Driven Word-Character Hybrid Model for Joint Chinese Word Segmentation and POS Tagging
  1. We evaluated both word segmentation (Seg) and joint word segmentation and POS tagging ( Seg & Tag).
    Page 6, “Experiments”
  2. For Seg , a token is considered to be a correct one if the word boundary is correctly identified.
    Page 6, “Experiments”
  3. For Seg & Tag, both the word boundary and its POS tag have to be correctly identified to be counted as a correct token.
    Page 6, “Experiments”
  4. Method ‘ Seg |Seg &Tag‘
    Page 7, “Experiments”
  5. Seg Seg &Tag N&U07 Z&C08 Ours N&U07 Z&C08 Ours Trial (base) (base) 1 0.9701 0.9721 0.9732 0.9262 0.9346 0.9358 2 0.9738 0.9762 0.9752 0.9318 0.9385 0.9380 3 0.9571 0.9594 0.9578 0.9023 0.9086 0.9067 4 0.9629 0.9592 0.9655 0.9132 0.9160 0.9223 5 0.9597 0.9606 0.9617 0.9132 0.9172 0.9187 6 0.9473 0.9456 0.9460 0.8823 0.8883 0.8885 7 0.9528 0.9500 0.9562 0.9003 0.9051 0.9076 8 0.9519 0.9512 0.9528 0.9002 0.9030 0.9062 9 0.9566 0.9479 0.9575 0.8996 0.9033 0.9052 10 0.9631 0.9645 0.9659 0.9154 0.9196 0.9225 Avg.
    Page 7, “Experiments”
  6. ‘ Method ‘ Seg ‘ Seg &Tag ‘ Ours (baseline) 0.9611 0.9152 Z&C08 0.9590 0.9134 N&U07 0.9595 0.9085 N&L04 0.9520 -
    Page 7, “Experiments”
  7. The result of our error-driven model is superior to preVious reported results for both Seg and Seg & Tag, and the result of our baseline model compares favorably to the others.
    Page 7, “Experiments”
  8. Our baseline model outperforms all prior approaches for both Seg and Seg & Tag, and we hope that our error-driven model can further improve performance.
    Page 8, “Experiments”

See all papers in Proc. ACL 2009 that mention Seg.

See all papers in Proc. ACL that mention Seg.

Back to top.

loss function

Appears in 7 sentences as: Loss function (1) loss function (6)
In An Error-Driven Word-Character Hybrid Model for Joint Chinese Word Segmentation and POS Tagging
  1. We describe k-best decoding for our hybrid model and design its loss function and the features appropriate for our task.
    Page 1, “Introduction”
  2. Given a training example (xt, yt), MIRA tries to establish a margin between the score of the correct path 3(xt,yt;w) and the score of the best candidate path 3(xt, j}; w) based on the current weight vector w that is proportional to a loss function L(yt, 37).
    Page 4, “Training method”
  3. 4.3 Loss function
    Page 4, “Training method”
  4. We instead compute the loss function through false positives (FF) and false negatives (FN).
    Page 4, “Training method”
  5. We define the loss function by:
    Page 4, “Training method”
  6. This loss function can reflect how bad the predicted path 37 is compared to the correct path yt.
    Page 4, “Training method”
  7. A weighted loss function based on FF and FN can be found in (Ganchev et al., 2007).
    Page 4, “Training method”

See all papers in Proc. ACL 2009 that mention loss function.

See all papers in Proc. ACL that mention loss function.

Back to top.

weight vector

Appears in 7 sentences as: weight vector (8) weight vectors (1)
In An Error-Driven Word-Character Hybrid Model for Joint Chinese Word Segmentation and POS Tagging
  1. Input: Training set 8 = {(xt,yt)}tT=1 Output: Model weight vector w
    Page 4, “Training method”
  2. where w is a weight vector and f is a feature representation of an input x and an output y.
    Page 4, “Training method”
  3. Learning a mapping between an input-output pair corresponds to finding a weight vector w such that the best scoring path of a given sentence is the same as (or close to) the correct path.
    Page 4, “Training method”
  4. Given a training example (xt, yt), MIRA tries to establish a margin between the score of the correct path 3(xt,yt;w) and the score of the best candidate path 3(xt, j}; w) based on the current weight vector w that is proportional to a loss function L(yt, 37).
    Page 4, “Training method”
  5. In each iteration, MIRA updates the weight vector w by keeping the norm of the change in the weight vector as small as possible.
    Page 4, “Training method”
  6. where bestk(xt; W(i)) E 32, represents a set of top k-best paths given the weight vector Wm.
    Page 4, “Training method”
  7. The final weight vector w is the average of the weight vectors after each iteration.
    Page 4, “Training method”

See all papers in Proc. ACL 2009 that mention weight vector.

See all papers in Proc. ACL that mention weight vector.

Back to top.

Chinese word

Appears in 4 sentences as: Chinese word (4)
In An Error-Driven Word-Character Hybrid Model for Joint Chinese Word Segmentation and POS Tagging
  1. In this paper, we present a discriminative word-character hybrid model for joint Chinese word segmentation and POS tagging.
    Page 1, “Abstract”
  2. Previous studies on joint Chinese word segmentation and POS tagging have used Penn Chinese Treebank (CTB) (Xia et al., 2000) in experiments.
    Page 6, “Experiments”
  3. For example, a perceptron algorithm is used for joint Chinese word segmentation and POS tagging (Zhang and Clark, 2008; Jiang et al., 2008a; Jiang et al., 2008b).
    Page 8, “Related work”
  4. In this paper, we presented a discriminative word-character hybrid model for joint Chinese word segmentation and POS tagging.
    Page 8, “Conclusion”

See all papers in Proc. ACL 2009 that mention Chinese word.

See all papers in Proc. ACL that mention Chinese word.

Back to top.

Chinese word segmentation

Appears in 4 sentences as: Chinese word segmentation (4)
In An Error-Driven Word-Character Hybrid Model for Joint Chinese Word Segmentation and POS Tagging
  1. In this paper, we present a discriminative word-character hybrid model for joint Chinese word segmentation and POS tagging.
    Page 1, “Abstract”
  2. Previous studies on joint Chinese word segmentation and POS tagging have used Penn Chinese Treebank (CTB) (Xia et al., 2000) in experiments.
    Page 6, “Experiments”
  3. For example, a perceptron algorithm is used for joint Chinese word segmentation and POS tagging (Zhang and Clark, 2008; Jiang et al., 2008a; Jiang et al., 2008b).
    Page 8, “Related work”
  4. In this paper, we presented a discriminative word-character hybrid model for joint Chinese word segmentation and POS tagging.
    Page 8, “Conclusion”

See all papers in Proc. ACL 2009 that mention Chinese word segmentation.

See all papers in Proc. ACL that mention Chinese word segmentation.

Back to top.

development set

Appears in 4 sentences as: (1) development set (3)
In An Error-Driven Word-Character Hybrid Model for Joint Chinese Word Segmentation and POS Tagging
  1. 4In our experiments, the optimal threshold value 7" is selected by evaluating the performance of joint word segmentation and POS tagging on the development set.
    Page 3, “Policies for correct path selection”
  2. Note that the development set was only used for evaluating the trained model to obtain the optimal values of tunable parameters.
    Page 6, “Experiments”
  3. For the baseline policy, we varied 7“ in the range of [1, 5] and found that setting 7“ = 3 yielded the best performance on the development set for both the small and large training corpus experiments.
    Page 6, “Experiments”
  4. Optimal balances were selected using the development set .
    Page 6, “Experiments”

See all papers in Proc. ACL 2009 that mention development set.

See all papers in Proc. ACL that mention development set.

Back to top.

unigram

Appears in 4 sentences as: Unigram (2) unigram (3)
In An Error-Driven Word-Character Hybrid Model for Joint Chinese Word Segmentation and POS Tagging
  1. Table 2: Unigram features.
    Page 5, “Training method”
  2. We broadly classify features into two categories: unigram and bigram features.
    Page 5, “Training method”
  3. Unigram features: Table 2 shows our unigram features.
    Page 5, “Training method”
  4. Templates W0—W3 are basic word-level unigram features, where Length(w0) denotes the length of the word wo.
    Page 5, “Training method”

See all papers in Proc. ACL 2009 that mention unigram.

See all papers in Proc. ACL that mention unigram.

Back to top.

Treebank

Appears in 3 sentences as: Treebank (3)
In An Error-Driven Word-Character Hybrid Model for Joint Chinese Word Segmentation and POS Tagging
  1. We describe an efficient framework for training our model based on the Margin Infused Relaxed Algorithm (MIRA), evaluate our approach on the Penn Chinese Treebank , and show that it achieves superior performance compared to the state-of-the-art approaches reported in the literature.
    Page 1, “Abstract”
  2. We conducted our experiments on Penn Chinese Treebank (Xia et al., 2000) and compared our approach with the best previous approaches reported in the literature.
    Page 1, “Introduction”
  3. Previous studies on joint Chinese word segmentation and POS tagging have used Penn Chinese Treebank (CTB) (Xia et al., 2000) in experiments.
    Page 6, “Experiments”

See all papers in Proc. ACL 2009 that mention Treebank.

See all papers in Proc. ACL that mention Treebank.

Back to top.