Abstract | While no segmented corpus of micro-blogs is available to train Chinese word segmentation model, existing Chinese word segmentation tools cannot perform equally well as in ordinary news texts. |
Abstract | In this paper we present an effective yet simple approach to Chinese word segmentation of micro-blog. |
Experiment | We use the benchmark datasets provided by the second International Chinese Word Segmentation Bakeoff2 as the labeled data. |
Experiment | The first two are both famous Chinese word segmentation tools: ICTCLAS3 and Stanford Chinese word segmenter4, which are widely used in NLP related to word segmentation. |
Experiment | Stanford Chinese word segmenter is a CRF-based segmentation tool and its segmentation standard is chosen as the PKU standard, which is the same to ours. |
INTRODUCTION | These new features of micro-blogs make the Chinese Word Segmentation (CWS) models trained on the source domain, such as news corpus, fail to perform equally well when transferred to texts from micro-blogs. |
Our method | Chinese word segmentation problem might be treated as a character labeling problem which gives each character a label indicating its position in one word. |
Related Work | Recent studies show that character sequence labeling is an effective formulation of Chinese word segmentation (Low et al., 2005; Zhao et al., 2006a,b; Chen et al., 2006; Xue, 2003). |
Related Work | (1998) takes advantage of the huge amount of raw text to solve Chinese word segmentation problems. |
Related Work | Besides, Sun and Xu (2011) uses a sequence labeling framework, while unsupervised statistics are used as discrete features in their model, which prove to be effective in Chinese word segmentation . |
Abstract | In this paper, we propose a novel neural network model for Chinese word segmentation called Max-Margin Tensor Neural Network (MMTNN). |
Abstract | Despite Chinese word segmentation being a specific case, MMTNN can be easily generalized and applied to other sequence labeling tasks. |
Conventional Neural Network | Formally, in the Chinese word segmentation task, we have a character dictionary D of size Unless otherwise specified, the character dictionary is extracted from the training set and unknown characters are mapped to a special symbol that is not used elsewhere. |
Conventional Neural Network | In Chinese word segmentation , the most prevalent tag set T is BMES tag set, which uses 4 tags to carry word boundary information. |
Conventional Neural Network | (2013) modeled Chinese word segmentation as a series of |
Introduction | (2011) to Chinese word segmentation and POS tagging and proposed a perceptron-style algorithm to speed up the training process with negligible loss in performance. |
Introduction | We evaluate the performance of Chinese word segmentation on the PKU and MSRA benchmark datasets in the second International Chinese Word Segmentation Bakeoff (Emerson, 2005) which are commonly used for evaluation of Chinese word segmentation . |
Introduction | 0 We propose a Max-Margin Tensor Neural Network for Chinese word segmentation without feature engineering. |
Max-Margin Tensor Neural Network | In Chinese word segmentation , a proper modeling of the tag-tag interaction, tag-character interaction and character-character interaction is very important. |
Abstract | We present a joint model for Chinese word segmentation and new word detection. |
Introduction | The major problem of Chinese word segmentation is the ambiguity. |
Introduction | In this paper, we present high dimensional new features, including word-based features and enriched edge (label-transition) features, for the joint modeling of Chinese word segmentation (CWS) and new word detection (NWD). |
Introduction | 0 We propose a joint model for Chinese word segmentation and new word detection. |
Related Work | Conventional approaches to Chinese word segmentation treat the problem as a sequential labeling task (Xue, 2003; Peng et al., 2004; Tseng et al., 2005; Asahara et al., 2005; Zhao et al., 2010). |
System Architecture | This phenomenon will also undermine the performance of Chinese word segmentation . |
System Architecture | The B, I, E labels have been widely used in previous work of Chinese word segmentation (Sun et al., 2009b). |
System Architecture | _ We used benchmark datasets provided by the second International Chinese Word Segmentation Bakeoff to test our proposals. |
Abstract | With Chinese word segmentation as a case study, experiments show that the segmenter enhanced with the Chinese wikipedia achieves significant improvement on a series of testing sets from different domains, even with a single classifier and local features. |
Conclusion and Future Work | Experiments on Chinese word segmentation show that, the enhanced word segmenter achieves significant improvement on testing sets of different domains, although using a single classifier with only local features. |
Experiments | We use the Penn Chinese Treebank 5.0 (CTB) (Xue et al., 2005) as the existing annotated corpus for Chinese word segmentation . |
Experiments | Table 4: Comparison with state-of-the-art work in Chinese word segmentation . |
Experiments | Table 4 shows the comparison with other work in Chinese word segmentation . |
Introduction | Taking Chinese word segmentation for example, the state-of-the-art models (Xue and Shen, 2003; Ng and Low, 2004; Gao et al., 2005; Nakagawa and Uchimoto, 2007; Zhao and Kit, 2008; J iang et al., 2009; Zhang and Clark, 2010; Sun, 2011b; Li, 2011) are usually trained on human-annotated corpora such as the Penn Chinese Treebank (CTB) (Xue et al., 2005), and perform quite well on corresponding test sets. |
Introduction | In the rest of the paper, we first briefly introduce the problems of Chinese word segmentation and the character classification model in section |
Related Work | Li and Sun (2009) extracted character classification instances from raw text for Chinese word segmentation , resorting to the indication of punctuation marks between characters. |
Related Work | Sun and Xu (Sun and Xu, 2011) utilized the features derived from large-scaled unlabeled text to improve Chinese word segmentation . |
Abstract | We show for both English POS tagging and Chinese word segmentation that with proper representation, large number of deterministic constraints can be learned from training examples, and these are useful in constraining probabilistic inference. |
Abstract | In this work, we explore deterministic constraints for two fundamental NLP problems, English POS tagging and Chinese word segmentation . |
Abstract | For Chinese word segmentation (CWS), which can be formulated as character tagging, analogous constraints can be learned with the same templates as English POS tagging. |
Abstract | The focus of recent studies on Chinese word segmentation , part-of-speech (POS) tagging and parsing has been shifting from words to characters. |
Conclusion | A Cascaded Linear Model for Joint Chinese Word Segmentation and Part-of-speech Tagging. |
Conclusion | Word Lattice Reranking for Chinese Word Segmentation |
Conclusion | An Error—Driven Word—Character Hybird Model for Joint Chinese Word Segmentation and POS Tagging. |
Introduction | In recent years, the focus of research on Chinese word segmentation , part-of-speech (POS) tagging and parsing has been shifting from words toward characters. |
About Heterogeneous Annotations | For Chinese word segmentation and POS tagging, supervised learning has become a dominant paradigm. |
About Heterogeneous Annotations | Take Chinese word segmentation for example. |
Abstract | We address the issue of consuming heterogeneous annotation data for Chinese word segmentation and part-of-speech tagging. |
Conclusion | Our theoretical and empirical analysis of two representative popular corpora highlights two essential characteristics of heterogeneous annotations which are eXplored to reduce approximation and estimation errors for Chinese word segmentation and POS tagging. |
Experiments | Previous studies on joint Chinese word segmentation and POS tagging have used the CTB in experiments. |
Introduction | This paper explores heterogeneous annotations to reduce both approximation and estimation errors for Chinese word segmentation and part-of-speech (POS) tagging, which are fundamental steps for more advanced Chinese language processing tasks. |
Abstract | Automatic extraction of new words is an indispensable precursor to many NLP tasks such as Chinese word segmentation , named entity extraction, and sentiment analysis. |
Experiment | The posts were then part-of-speech tagged using a Chinese word segmentation tool named ICTCLAS (Zhang et al., 2003). |
Introduction | Automatic extraction of new words is indispensable to many tasks such as Chinese word segmentation , machine translation, named entity extraction, question answering, and sentiment analysis. |
Introduction | New word detection is one of the most critical issues in Chinese word segmentation . |
Methodology | Obviously, in order to obtain the value of 3(wi), some particular Chinese word segmentation tool is required. |
Abstract | We test the efficacy of this method in the context of Chinese word segmentation and part-of-speech tagging, where no segmentation and POS tagging standards are widely accepted due to the lack of morphology in Chinese. |
Introduction | To test the efficacy of our method we choose Chinese word segmentation and part-of-speech tagging, where the problem of incompatible annotation standards is one of the most evident: so far no segmentation standard is widely accepted due to the lack of a clear definition of Chinese words, and the (almost complete) lack of morphology results in much bigger ambiguities and heavy debates in tagging philosophies for Chinese parts-of-speech. |
Segmentation and Tagging as Character Classification | Xue and Shen (2003) describe for the first time the character classification approach for Chinese word segmentation , Where each character is given a boundary tag denoting its relative position in a word. |
Segmentation and Tagging as Character Classification | It is an online training algorithm and has been successfully used in many NLP tasks, such as POS tagging (Collins, 2002), parsing (Collins and Roark, 2004), Chinese word segmentation (Zhang and Clark, 2007; J iang et al., 2008), and so on. |
Abstract | In this paper, we present a discriminative word-character hybrid model for joint Chinese word segmentation and POS tagging. |
Conclusion | In this paper, we presented a discriminative word-character hybrid model for joint Chinese word segmentation and POS tagging. |
Experiments | Previous studies on joint Chinese word segmentation and POS tagging have used Penn Chinese Treebank (CTB) (Xia et al., 2000) in experiments. |
Related work | For example, a perceptron algorithm is used for joint Chinese word segmentation and POS tagging (Zhang and Clark, 2008; Jiang et al., 2008a; Jiang et al., 2008b). |
Abstract | We exploit this reliance as an opportunity: recognizing the relation between informal word recognition and Chinese word segmentation , we propose to model the two tasks jointly. |
Conclusion | There is a close dependency between Chinese word segmentation (CWS) and informal word recognition (IWR). |
Introduction | This example illustrates the mutual dependency between Chinese word segmentation (henceforth, CWS) and informal word recognition (IWR) that should be solved jointly. |
Methodology | Given an input Chinese microblog post, our method simultaneously segments the sentences into words (the Chinese Word Segmentation , CWS, task), and marks the component words as informal or formal ones (the Informal Word Re-congition, IWR, task). |
Character-Level Dependency Tree | Zhao (2009) was the first to study character-level dependencies; they argue that since no consistent word boundaries exist over Chinese word segmentation, dependency-based representations of word structures serve as a good alternative for Chinese word segmentation . |
Character-Level Dependency Tree | (2012) proposed a joint model for Chinese word segmentation , POS-tagging and dependency parsing, studying the influence of joint model and character features for parsing, Their model is extended from the arc-standard transition-based model, and can be regarded as an alternative to the arc-standard model of our work when pseudo intra-word dependencies are used. |
Introduction | First, character-level trees circumvent the issue that no universal standard exists for Chinese word segmentation . |
Introduction | In the well-known Chinese word segmentation bakeoff tasks, for example, different segmentation standards have been used by different data sets (Emerson, 2005). |
Abstract | This paper introduces a graph-based semi-supervised joint model of Chinese word segmentation and part-of-speech tagging. |
Introduction | As far as we know, however, these methods have not yet been applied to resolve the problem of joint Chinese word segmentation (CWS) and POS tagging. |
Method | This study introduces a novel semi-supervised approach for joint Chinese word segmentation and POS tagging. |
Related Work | Zhao (2009) studied character-level dependencies for Chinese word segmentation by formalizing segmentsion task in a dependency parsing framework. |
Related Work | They use it as a joint framework to perform Chinese word segmentation , POS tagging and syntax parsing. |
Related Work | They exploit a generative maximum entropy model for character-based constituent parsing, and find that POS information is very useful for Chinese word segmentation , but high-level syntactic information seems to have little effect on segmentation. |
Abstract | Both Omni-word feature and soft constraint make a better use of sentence information and minimize the influences caused by Chinese word segmentation and parsing. |
Introduction | Lacking of orthographic word makes Chinese word segmentation difficult. |
Related Work | (2008; 2010) also pointed out that, due to the inaccuracy of Chinese word segmentation and parsing, the tree kernel based approach is inappropriate for Chinese relation extraction. |