Abstract | We test the efficacy of this method in the context of Chinese word segmentation and part-of-speech tagging, where no segmentation and POS tagging standards are widely accepted due to the lack of morphology in Chinese. |
Automatic Annotation Adaptation | Considering that word segmentation and Joint S&T can be conducted in the same character classification manner, we can design an unified standard adaptation framework for the two tasks, by taking the source classifier’s classification result as the guide information for the target classifier’s classification decision. |
Introduction | Figure l: Incompatible word segmentation and POS tagging standards between CTB (upper) and People’s Daily (below). |
Introduction | To test the efficacy of our method we choose Chinese word segmentation and part-of-speech tagging, where the problem of incompatible annotation standards is one of the most evident: so far no segmentation standard is widely accepted due to the lack of a clear definition of Chinese words, and the (almost complete) lack of morphology results in much bigger ambiguities and heavy debates in tagging philosophies for Chinese parts-of-speech. |
Introduction | In addition, the improved accuracies from segmentation and tagging also lead to an improved parsing accuracy on CTB, reducing 38% of the error propagation from word segmentation to parsing. |
Segmentation and Tagging as Character Classification | 01 02 .. On where C,- is a character, word segmentation aims to split the sequence into m(§ n) words: 01161 Cel+lzeg -- Cem_1+1:em |
Segmentation and Tagging as Character Classification | Xue and Shen (2003) describe for the first time the character classification approach for Chinese word segmentation , Where each character is given a boundary tag denoting its relative position in a word. |
Segmentation and Tagging as Character Classification | In addition, Ng and Low (2004) find that, compared with POS tagging after word segmentation , Joint S&T can achieve higher accuracy on both segmentation and POS tagging. |
Abstract | In this paper, we present a discriminative word-character hybrid model for joint Chinese word segmentation and POS tagging. |
Background | In joint word segmentation and the POS tagging process, the task is to predict a path |
Experiments | Previous studies on joint Chinese word segmentation and POS tagging have used Penn Chinese Treebank (CTB) (Xia et al., 2000) in experiments. |
Experiments | We evaluated both word segmentation (Seg) and joint word segmentation and POS tagging (Seg & Tag). |
Experiments | (2008a; 2008b) on CTB 5.0 and Zhang and Clark (2008) on CTB 4.0 since they reported the best performances on joint word segmentation and POS tagging using the training materials only derived from the corpora. |
Introduction | In Chinese, word segmentation and part-of-speech (POS) tagging are indispensable steps for higher-level NLP tasks. |
Introduction | Word segmentation and POS tagging results are required as inputs to other NLP tasks, such as phrase chunking, dependency parsing, and machine translation. |
Introduction | Word segmentation and POS tagging in a joint process have received much attention in recent research and have shown improvements over a pipelined fashion (Ng and Low, 2004; Nakagawa and Uchimoto, 2007; Zhang and Clark, 2008; Jiang et al., 2008a; Jiang et al., 2008b). |
Policies for correct path selection | 4In our experiments, the optimal threshold value 7" is selected by evaluating the performance of joint word segmentation and POS tagging on the development set. |
Related work | Maximum entropy models are widely used for word segmentation and POS tagging tasks (Uchimoto et al., 2001; Ng and Low, 2004; Nakagawa, 2004; Nakagawa and Uchimoto, 2007) since they only need moderate training times while they provide reasonable performance. |
Abstract | In this paper, we propose a new Bayesian model for fully unsupervised word segmentation and an efficient blocked Gibbs sampler combined with dynamic programming for inference. |
Abstract | We confirmed that it significantly outperforms previous reported results in both phonetic transcripts and standard datasets for Chinese and Japanese word segmentation . |
Inference | To find the hidden word segmentation w of a string 3 = 01 - - - c N, which is equivalent to the vector of binary hidden variables 2 = 21 - - - ZN, the simplest approach is to build a Gibbs sampler that randomly selects a character c,- and draw a binary decision 2,- as to whether there is a word boundary, and then update the language model according to the new segmentation (Goldwater et al., 2006; Xu et al., 2008). |
Introduction | Asian languages such as Chinese and Japanese have no explicit word boundaries, thus word segmentation is a crucial first step when processing them. |
Introduction | In order to extract “words” from text streams, unsupervised word segmentation is an important research area because the criteria for creating supervised training data could be arbitrary, and will be suboptimal for applications that rely on segmentations. |
Introduction | This maximizes the probability of word segmentation w given a string 3 : |
Nested Pitman-Yor Language Model | If a lexicon is finite, we can use a uniform prior G0(w) = l/|V| for every word 21) in lexicon V. However, with word segmentation every substring could be a word, thus the lexicon is not limited but will be countably infinite. |
Nested Pitman-Yor Language Model | Building an accurate G0 is crucial for word segmentation , since it determines how the possible words will look like. |
Exploiting the Translated Treebank | However, Chinese has a special primary processing task, i.e., word segmentation . |
Exploiting the Translated Treebank | Note that CTB or any other Chinese treebank has its own word segmentation guideline. |
Exploiting the Translated Treebank | English treebank is translated into Chinese word by word, Chinese words in the translated text are exactly some entries from the bilingual lexicon, they are actually irregular phrases, short sentences or something else rather than words that follows any existing word segmentation convention. |