Improving Chinese Word Segmentation on Micro-blog Using Rich Punctuations
Zhang, Longkai and Li, Li and He, Zhengyan and Wang, Houfeng and Sun, Ni

Article Structure

Abstract

Micro-blog is a new kind of medium which is short and informal.

INTRODUCTION

Micro-blog (also known as tweets in English) is a new kind of broadcast medium in the form of blogging.

Our method

2.1 Punctuations

Experiment

3.1 Data set

Related Work

Recent studies show that character sequence labeling is an effective formulation of Chinese word segmentation (Low et al., 2005; Zhao et al., 2006a,b; Chen et al., 2006; Xue, 2003).

Conclusion

In this paper we have presented an effective yet simple approach to Chinese word segmentation on micro-blog texts.

Topics

word segmentation

Appears in 13 sentences as: Word Segmentation (2) word segmentation (11) word segmenter (2)
In Improving Chinese Word Segmentation on Micro-blog Using Rich Punctuations
  1. While no segmented corpus of micro-blogs is available to train Chinese word segmentation model, existing Chinese word segmentation tools cannot perform equally well as in ordinary news texts.
    Page 1, “Abstract”
  2. In this paper we present an effective yet simple approach to Chinese word segmentation of micro-blog.
    Page 1, “Abstract”
  3. These new features of micro-blogs make the Chinese Word Segmentation (CWS) models trained on the source domain, such as news corpus, fail to perform equally well when transferred to texts from micro-blogs.
    Page 1, “INTRODUCTION”
  4. Chinese word segmentation problem might be treated as a character labeling problem which gives each character a label indicating its position in one word.
    Page 1, “Our method”
  5. We use the benchmark datasets provided by the second International Chinese Word Segmentation Bakeoff2 as the labeled data.
    Page 2, “Experiment”
  6. The first two are both famous Chinese word segmentation tools: ICTCLAS3 and Stanford Chinese word segmenter4, which are widely used in NLP related to word segmentation .
    Page 2, “Experiment”
  7. Stanford Chinese word segmenter is a CRF-based segmentation tool and its segmentation standard is chosen as the PKU standard, which is the same to ours.
    Page 2, “Experiment”
  8. ICTCLAS, on the other hand, is a HMM-based Chinese word segmenter .
    Page 2, “Experiment”
  9. Recent studies show that character sequence labeling is an effective formulation of Chinese word segmentation (Low et al., 2005; Zhao et al., 2006a,b; Chen et al., 2006; Xue, 2003).
    Page 4, “Related Work”
  10. On the other hand unsupervised word segmentation Peng and Schu-urmans (2001); Goldwater et al.
    Page 4, “Related Work”
  11. (1998) takes advantage of the huge amount of raw text to solve Chinese word segmentation problems.
    Page 4, “Related Work”

See all papers in Proc. ACL 2013 that mention word segmentation.

See all papers in Proc. ACL that mention word segmentation.

Back to top.

Chinese word

Appears in 12 sentences as: Chinese Word (2) Chinese word (12)
In Improving Chinese Word Segmentation on Micro-blog Using Rich Punctuations
  1. While no segmented corpus of micro-blogs is available to train Chinese word segmentation model, existing Chinese word segmentation tools cannot perform equally well as in ordinary news texts.
    Page 1, “Abstract”
  2. In this paper we present an effective yet simple approach to Chinese word segmentation of micro-blog.
    Page 1, “Abstract”
  3. These new features of micro-blogs make the Chinese Word Segmentation (CWS) models trained on the source domain, such as news corpus, fail to perform equally well when transferred to texts from micro-blogs.
    Page 1, “INTRODUCTION”
  4. Chinese word segmentation problem might be treated as a character labeling problem which gives each character a label indicating its position in one word.
    Page 1, “Our method”
  5. We use the benchmark datasets provided by the second International Chinese Word Segmentation Bakeoff2 as the labeled data.
    Page 2, “Experiment”
  6. The first two are both famous Chinese word segmentation tools: ICTCLAS3 and Stanford Chinese word segmenter4, which are widely used in NLP related to word segmentation.
    Page 2, “Experiment”
  7. Stanford Chinese word segmenter is a CRF-based segmentation tool and its segmentation standard is chosen as the PKU standard, which is the same to ours.
    Page 2, “Experiment”
  8. ICTCLAS, on the other hand, is a HMM-based Chinese word segmenter.
    Page 2, “Experiment”
  9. Recent studies show that character sequence labeling is an effective formulation of Chinese word segmentation (Low et al., 2005; Zhao et al., 2006a,b; Chen et al., 2006; Xue, 2003).
    Page 4, “Related Work”
  10. (1998) takes advantage of the huge amount of raw text to solve Chinese word segmentation problems.
    Page 4, “Related Work”
  11. Besides, Sun and Xu (2011) uses a sequence labeling framework, while unsupervised statistics are used as discrete features in their model, which prove to be effective in Chinese word segmentation.
    Page 4, “Related Work”

See all papers in Proc. ACL 2013 that mention Chinese word.

See all papers in Proc. ACL that mention Chinese word.

Back to top.

Chinese word segmentation

Appears in 12 sentences as: Chinese Word Segmentation (2) Chinese word segmentation (9) Chinese word segmenter (2)
In Improving Chinese Word Segmentation on Micro-blog Using Rich Punctuations
  1. While no segmented corpus of micro-blogs is available to train Chinese word segmentation model, existing Chinese word segmentation tools cannot perform equally well as in ordinary news texts.
    Page 1, “Abstract”
  2. In this paper we present an effective yet simple approach to Chinese word segmentation of micro-blog.
    Page 1, “Abstract”
  3. These new features of micro-blogs make the Chinese Word Segmentation (CWS) models trained on the source domain, such as news corpus, fail to perform equally well when transferred to texts from micro-blogs.
    Page 1, “INTRODUCTION”
  4. Chinese word segmentation problem might be treated as a character labeling problem which gives each character a label indicating its position in one word.
    Page 1, “Our method”
  5. We use the benchmark datasets provided by the second International Chinese Word Segmentation Bakeoff2 as the labeled data.
    Page 2, “Experiment”
  6. The first two are both famous Chinese word segmentation tools: ICTCLAS3 and Stanford Chinese word segmenter4, which are widely used in NLP related to word segmentation.
    Page 2, “Experiment”
  7. Stanford Chinese word segmenter is a CRF-based segmentation tool and its segmentation standard is chosen as the PKU standard, which is the same to ours.
    Page 2, “Experiment”
  8. ICTCLAS, on the other hand, is a HMM-based Chinese word segmenter .
    Page 2, “Experiment”
  9. Recent studies show that character sequence labeling is an effective formulation of Chinese word segmentation (Low et al., 2005; Zhao et al., 2006a,b; Chen et al., 2006; Xue, 2003).
    Page 4, “Related Work”
  10. (1998) takes advantage of the huge amount of raw text to solve Chinese word segmentation problems.
    Page 4, “Related Work”
  11. Besides, Sun and Xu (2011) uses a sequence labeling framework, while unsupervised statistics are used as discrete features in their model, which prove to be effective in Chinese word segmentation .
    Page 4, “Related Work”

See all papers in Proc. ACL 2013 that mention Chinese word segmentation.

See all papers in Proc. ACL that mention Chinese word segmentation.

Back to top.

f-score

Appears in 6 sentences as: F-score (1) f-score (6)
In Improving Chinese Word Segmentation on Micro-blog Using Rich Punctuations
  1. For example, the most widely used Chinese segmenter ”ICTCLAS” yields 0.95 f-score in news corpus, only gets 0.82 f-score on micro-blog data.
    Page 1, “INTRODUCTION”
  2. F-score
    Page 2, “Experiment”
  3. Both the f-score and OOV—recall increase.
    Page 3, “Experiment”
  4. By comparing No-balance and ADD-N alone we can find that we achieve relatively high f-score if we ignore tag balance issue, while slightly hurt the OOV—Recall.
    Page 3, “Experiment”
  5. However, considering it will improve OOV—Recall by about +16% and the f-score +02%.
    Page 3, “Experiment”
  6. We note that when the number of texts changes from 0 to 50,000, the f-score and 00V both are improved.
    Page 3, “Experiment”

See all papers in Proc. ACL 2013 that mention f-score.

See all papers in Proc. ACL that mention f-score.

Back to top.

unlabeled data

Appears in 5 sentences as: unlabeled data (6)
In Improving Chinese Word Segmentation on Micro-blog Using Rich Punctuations
  1. To keep the experiment tractable, we first randomly choose 50,000 of all the texts as unlabeled data , which contain 2,420,037 characters.
    Page 2, “Experiment”
  2. We also experimented on different size of unlabeled data to evaluate the performance when adding unlabeled target domain data.
    Page 3, “Experiment”
  3. TABLE 5 shows different f-scores and OOV—Recalls on different unlabeled data set.
    Page 3, “Experiment”
  4. However, when unlabeled data changes to 200,000, the performance is a bit decreased, while still better than not using unlabeled data .
    Page 3, “Experiment”
  5. Table 5: Segmentation performance with different size of unlabeled data
    Page 4, “Experiment”

See all papers in Proc. ACL 2013 that mention unlabeled data.

See all papers in Proc. ACL that mention unlabeled data.

Back to top.

Maxent

Appears in 4 sentences as: Maxent (4)
In Improving Chinese Word Segmentation on Micro-blog Using Rich Punctuations
  1. Method P R F OOV—R Stanford 0.861 0.853 0.857 0.639 ICTCLAS 0.812 0.861 0.836 0.602 Li-Sun 0.707 0.820 0.760 0.734 Maxent 0.868 0.844 0.856 0.760 No-punc 0.865 0.829 0.846 0.760 No-balance 0.869 0.877 0.873 0.757 Our method 0.875 0.875 0.875 0.773
    Page 3, “Experiment”
  2. Maxent only uses the PKU data for training, with neither punctuation information nor self-training framework incorporated.
    Page 3, “Experiment”
  3. The comparison of Maxent and No-punctuation
    Page 3, “Experiment”
  4. The comparison of Maxent , No-balance and ADD-N shows that considering punctuation as well as self-training does improve performance.
    Page 3, “Experiment”

See all papers in Proc. ACL 2013 that mention Maxent.

See all papers in Proc. ACL that mention Maxent.

Back to top.

semi-supervised

Appears in 4 sentences as: semi-supervised (4)
In Improving Chinese Word Segmentation on Micro-blog Using Rich Punctuations
  1. We build a semi-supervised learning (SSL) framework which can iteratively incorporate newly labeled instances from unlabeled micro-blog data during the training process.
    Page 1, “INTRODUCTION”
  2. Another baseline is Li and Sun (2009), which also uses punctuation in their semi-supervised framework.
    Page 2, “Experiment”
  3. Meanwhile semi-supervised methods have been applied into NLP applications.
    Page 4, “Related Work”
  4. Similar semi-supervised applications include Shen et al.
    Page 4, “Related Work”

See all papers in Proc. ACL 2013 that mention semi-supervised.

See all papers in Proc. ACL that mention semi-supervised.

Back to top.

labeled data

Appears in 3 sentences as: labeled data (3)
In Improving Chinese Word Segmentation on Micro-blog Using Rich Punctuations
  1. We randomly reuse some characters labeling ’N’ from labeled data until ratio 77 is reached.
    Page 2, “Our method”
  2. In summary our algorithm tackles the problem by duplicating labeled data in source domain.
    Page 2, “Our method”
  3. We use the benchmark datasets provided by the second International Chinese Word Segmentation Bakeoff2 as the labeled data .
    Page 2, “Experiment”

See all papers in Proc. ACL 2013 that mention labeled data.

See all papers in Proc. ACL that mention labeled data.

Back to top.

models trained

Appears in 3 sentences as: model trained (1) models trained (2)
In Improving Chinese Word Segmentation on Micro-blog Using Rich Punctuations
  1. These new features of micro-blogs make the Chinese Word Segmentation (CWS) models trained on the source domain, such as news corpus, fail to perform equally well when transferred to texts from micro-blogs.
    Page 1, “INTRODUCTION”
  2. Because of this, the model trained on this unbalanced corpus tends to be biased.
    Page 2, “Our method”
  3. When segmenting texts of the target domain using models trained on source domain, the performance will be hurt with more false segmented instances added into the training set.
    Page 3, “Experiment”

See all papers in Proc. ACL 2013 that mention models trained.

See all papers in Proc. ACL that mention models trained.

Back to top.

sequence labeling

Appears in 3 sentences as: sequence labeling (4)
In Improving Chinese Word Segmentation on Micro-blog Using Rich Punctuations
  1. Recent studies show that character sequence labeling is an effective formulation of Chinese word segmentation (Low et al., 2005; Zhao et al., 2006a,b; Chen et al., 2006; Xue, 2003).
    Page 4, “Related Work”
  2. Besides, Sun and Xu (2011) uses a sequence labeling framework, while unsupervised statistics are used as discrete features in their model, which prove to be effective in Chinese word segmentation.
    Page 4, “Related Work”
  3. Sun and Xu (2011) uses punctuation information as discrete feature in a sequence labeling framework, which shows improvement compared to the pure sequence labeling approach.
    Page 4, “Related Work”

See all papers in Proc. ACL 2013 that mention sequence labeling.

See all papers in Proc. ACL that mention sequence labeling.

Back to top.