Abstract | We present a joint model for Chinese word segmentation and new word detection. |
Introduction | The major problem of Chinese word segmentation is the ambiguity. |
Introduction | In this paper, we present high dimensional new features, including word-based features and enriched edge (label-transition) features, for the joint modeling of Chinese word segmentation (CWS) and new word detection (NWD). |
Introduction | 0 We propose a joint model for Chinese word segmentation and new word detection. |
Related Work | Conventional approaches to Chinese word segmentation treat the problem as a sequential labeling task (Xue, 2003; Peng et al., 2004; Tseng et al., 2005; Asahara et al., 2005; Zhao et al., 2010). |
System Architecture | This phenomenon will also undermine the performance of Chinese word segmentation. |
System Architecture | The B, I, E labels have been widely used in previous work of Chinese word segmentation (Sun et al., 2009b). |
System Architecture | _ We used benchmark datasets provided by the second International Chinese Word Segmentation Bakeoff to test our proposals. |
Abstract | We show for both English POS tagging and Chinese word segmentation that with proper representation, large number of deterministic constraints can be learned from training examples, and these are useful in constraining probabilistic inference. |
Abstract | In this work, we explore deterministic constraints for two fundamental NLP problems, English POS tagging and Chinese word segmentation. |
Abstract | For Chinese word segmentation (CWS), which can be formulated as character tagging, analogous constraints can be learned with the same templates as English POS tagging. |
About Heterogeneous Annotations | For Chinese word segmentation and POS tagging, supervised learning has become a dominant paradigm. |
About Heterogeneous Annotations | Take Chinese word segmentation for example. |
Abstract | We address the issue of consuming heterogeneous annotation data for Chinese word segmentation and part-of-speech tagging. |
Conclusion | Our theoretical and empirical analysis of two representative popular corpora highlights two essential characteristics of heterogeneous annotations which are eXplored to reduce approximation and estimation errors for Chinese word segmentation and POS tagging. |
Experiments | Previous studies on joint Chinese word segmentation and POS tagging have used the CTB in experiments. |
Introduction | This paper explores heterogeneous annotations to reduce both approximation and estimation errors for Chinese word segmentation and part-of-speech (POS) tagging, which are fundamental steps for more advanced Chinese language processing tasks. |
Discussion and Future Work | Chinese word segmentation. |
Experiments | We use the Stanford Chinese word segmenter (Tseng et al., 2005) and POS tagger (Toutanova et al., 2003) for preprocessing and Cilin for synonym |
Experiments | In all our experiments here we use TESLA-CELAB with n- grams for 77. up to four, since the vast majority of Chinese words , and therefore synonyms, are at most four characters long. |