Index of papers in Proc. ACL 2009 that mention
  • word segmentation
Jiang, Wenbin and Huang, Liang and Liu, Qun
We test the efficacy of this method in the context of Chinese word segmentation and part-of-speech tagging, where no segmentation and POS tagging standards are widely accepted due to the lack of morphology in Chinese.
Automatic Annotation Adaptation
Considering that word segmentation and Joint S&T can be conducted in the same character classification manner, we can design an unified standard adaptation framework for the two tasks, by taking the source classifier’s classification result as the guide information for the target classifier’s classification decision.
Figure l: Incompatible word segmentation and POS tagging standards between CTB (upper) and People’s Daily (below).
To test the efficacy of our method we choose Chinese word segmentation and part-of-speech tagging, where the problem of incompatible annotation standards is one of the most evident: so far no segmentation standard is widely accepted due to the lack of a clear definition of Chinese words, and the (almost complete) lack of morphology results in much bigger ambiguities and heavy debates in tagging philosophies for Chinese parts-of-speech.
In addition, the improved accuracies from segmentation and tagging also lead to an improved parsing accuracy on CTB, reducing 38% of the error propagation from word segmentation to parsing.
Segmentation and Tagging as Character Classification
01 02 .. On where C,- is a character, word segmentation aims to split the sequence into m(§ n) words: 01161 Cel+lzeg -- Cem_1+1:em
Segmentation and Tagging as Character Classification
Xue and Shen (2003) describe for the first time the character classification approach for Chinese word segmentation , Where each character is given a boundary tag denoting its relative position in a word.
Segmentation and Tagging as Character Classification
In addition, Ng and Low (2004) find that, compared with POS tagging after word segmentation , Joint S&T can achieve higher accuracy on both segmentation and POS tagging.
word segmentation is mentioned in 27 sentences in this paper.
Topics mentioned in this paper:
Kruengkrai, Canasai and Uchimoto, Kiyotaka and Kazama, Jun'ichi and Wang, Yiou and Torisawa, Kentaro and Isahara, Hitoshi
In this paper, we present a discriminative word-character hybrid model for joint Chinese word segmentation and POS tagging.
In joint word segmentation and the POS tagging process, the task is to predict a path
Previous studies on joint Chinese word segmentation and POS tagging have used Penn Chinese Treebank (CTB) (Xia et al., 2000) in experiments.
We evaluated both word segmentation (Seg) and joint word segmentation and POS tagging (Seg & Tag).
(2008a; 2008b) on CTB 5.0 and Zhang and Clark (2008) on CTB 4.0 since they reported the best performances on joint word segmentation and POS tagging using the training materials only derived from the corpora.
In Chinese, word segmentation and part-of-speech (POS) tagging are indispensable steps for higher-level NLP tasks.
Word segmentation and POS tagging results are required as inputs to other NLP tasks, such as phrase chunking, dependency parsing, and machine translation.
Word segmentation and POS tagging in a joint process have received much attention in recent research and have shown improvements over a pipelined fashion (Ng and Low, 2004; Nakagawa and Uchimoto, 2007; Zhang and Clark, 2008; Jiang et al., 2008a; Jiang et al., 2008b).
Policies for correct path selection
4In our experiments, the optimal threshold value 7" is selected by evaluating the performance of joint word segmentation and POS tagging on the development set.
Related work
Maximum entropy models are widely used for word segmentation and POS tagging tasks (Uchimoto et al., 2001; Ng and Low, 2004; Nakagawa, 2004; Nakagawa and Uchimoto, 2007) since they only need moderate training times while they provide reasonable performance.
word segmentation is mentioned in 14 sentences in this paper.
Topics mentioned in this paper:
Mochihashi, Daichi and Yamada, Takeshi and Ueda, Naonori
In this paper, we propose a new Bayesian model for fully unsupervised word segmentation and an efficient blocked Gibbs sampler combined with dynamic programming for inference.
We confirmed that it significantly outperforms previous reported results in both phonetic transcripts and standard datasets for Chinese and Japanese word segmentation .
To find the hidden word segmentation w of a string 3 = 01 - - - c N, which is equivalent to the vector of binary hidden variables 2 = 21 - - - ZN, the simplest approach is to build a Gibbs sampler that randomly selects a character c,- and draw a binary decision 2,- as to whether there is a word boundary, and then update the language model according to the new segmentation (Goldwater et al., 2006; Xu et al., 2008).
Asian languages such as Chinese and Japanese have no explicit word boundaries, thus word segmentation is a crucial first step when processing them.
In order to extract “words” from text streams, unsupervised word segmentation is an important research area because the criteria for creating supervised training data could be arbitrary, and will be suboptimal for applications that rely on segmentations.
This maximizes the probability of word segmentation w given a string 3 :
Nested Pitman-Yor Language Model
If a lexicon is finite, we can use a uniform prior G0(w) = l/|V| for every word 21) in lexicon V. However, with word segmentation every substring could be a word, thus the lexicon is not limited but will be countably infinite.
Nested Pitman-Yor Language Model
Building an accurate G0 is crucial for word segmentation , since it determines how the possible words will look like.
word segmentation is mentioned in 25 sentences in this paper.
Topics mentioned in this paper:
Zhao, Hai and Song, Yan and Kit, Chunyu and Zhou, Guodong
Exploiting the Translated Treebank
However, Chinese has a special primary processing task, i.e., word segmentation .
Exploiting the Translated Treebank
Note that CTB or any other Chinese treebank has its own word segmentation guideline.
Exploiting the Translated Treebank
English treebank is translated into Chinese word by word, Chinese words in the translated text are exactly some entries from the bilingual lexicon, they are actually irregular phrases, short sentences or something else rather than words that follows any existing word segmentation convention.
word segmentation is mentioned in 4 sentences in this paper.
Topics mentioned in this paper: