Abstract | While no segmented corpus of micro-blogs is available to train Chinese word segmentation model, existing Chinese word segmentation tools cannot perform equally well as in ordinary news texts. |
Abstract | In this paper we present an effective yet simple approach to Chinese word segmentation of micro-blog. |
Experiment | We use the benchmark datasets provided by the second International Chinese Word Segmentation Bakeoff2 as the labeled data. |
Experiment | The first two are both famous Chinese word segmentation tools: ICTCLAS3 and Stanford Chinese word segmenter4, which are widely used in NLP related to word segmentation. |
Experiment | Stanford Chinese word segmenter is a CRF-based segmentation tool and its segmentation standard is chosen as the PKU standard, which is the same to ours. |
INTRODUCTION | These new features of micro-blogs make the Chinese Word Segmentation (CWS) models trained on the source domain, such as news corpus, fail to perform equally well when transferred to texts from micro-blogs. |
Our method | Chinese word segmentation problem might be treated as a character labeling problem which gives each character a label indicating its position in one word. |
Related Work | Recent studies show that character sequence labeling is an effective formulation of Chinese word segmentation (Low et al., 2005; Zhao et al., 2006a,b; Chen et al., 2006; Xue, 2003). |
Related Work | (1998) takes advantage of the huge amount of raw text to solve Chinese word segmentation problems. |
Related Work | Besides, Sun and Xu (2011) uses a sequence labeling framework, while unsupervised statistics are used as discrete features in their model, which prove to be effective in Chinese word segmentation . |
Abstract | With Chinese word segmentation as a case study, experiments show that the segmenter enhanced with the Chinese wikipedia achieves significant improvement on a series of testing sets from different domains, even with a single classifier and local features. |
Conclusion and Future Work | Experiments on Chinese word segmentation show that, the enhanced word segmenter achieves significant improvement on testing sets of different domains, although using a single classifier with only local features. |
Experiments | We use the Penn Chinese Treebank 5.0 (CTB) (Xue et al., 2005) as the existing annotated corpus for Chinese word segmentation . |
Experiments | Table 4: Comparison with state-of-the-art work in Chinese word segmentation . |
Experiments | Table 4 shows the comparison with other work in Chinese word segmentation . |
Introduction | Taking Chinese word segmentation for example, the state-of-the-art models (Xue and Shen, 2003; Ng and Low, 2004; Gao et al., 2005; Nakagawa and Uchimoto, 2007; Zhao and Kit, 2008; J iang et al., 2009; Zhang and Clark, 2010; Sun, 2011b; Li, 2011) are usually trained on human-annotated corpora such as the Penn Chinese Treebank (CTB) (Xue et al., 2005), and perform quite well on corresponding test sets. |
Introduction | In the rest of the paper, we first briefly introduce the problems of Chinese word segmentation and the character classification model in section |
Related Work | Li and Sun (2009) extracted character classification instances from raw text for Chinese word segmentation , resorting to the indication of punctuation marks between characters. |
Related Work | Sun and Xu (Sun and Xu, 2011) utilized the features derived from large-scaled unlabeled text to improve Chinese word segmentation . |
Abstract | We exploit this reliance as an opportunity: recognizing the relation between informal word recognition and Chinese word segmentation , we propose to model the two tasks jointly. |
Conclusion | There is a close dependency between Chinese word segmentation (CWS) and informal word recognition (IWR). |
Introduction | This example illustrates the mutual dependency between Chinese word segmentation (henceforth, CWS) and informal word recognition (IWR) that should be solved jointly. |
Methodology | Given an input Chinese microblog post, our method simultaneously segments the sentences into words (the Chinese Word Segmentation , CWS, task), and marks the component words as informal or formal ones (the Informal Word Re-congition, IWR, task). |
Abstract | This paper introduces a graph-based semi-supervised joint model of Chinese word segmentation and part-of-speech tagging. |
Introduction | As far as we know, however, these methods have not yet been applied to resolve the problem of joint Chinese word segmentation (CWS) and POS tagging. |
Method | This study introduces a novel semi-supervised approach for joint Chinese word segmentation and POS tagging. |
Related Work | Zhao (2009) studied character-level dependencies for Chinese word segmentation by formalizing segmentsion task in a dependency parsing framework. |
Related Work | They use it as a joint framework to perform Chinese word segmentation , POS tagging and syntax parsing. |
Related Work | They exploit a generative maximum entropy model for character-based constituent parsing, and find that POS information is very useful for Chinese word segmentation , but high-level syntactic information seems to have little effect on segmentation. |