Abstract | Current computational models of unsupervised word segmentation usually assume idealized input that is devoid of these kinds of variation. |
Abstract | We extend a nonparametric model of word segmentation by adding phonological rules that map from underlying forms to surface forms to produce a mathematically well-defined joint model as a first step towards handling variation and segmentation in a single model. |
Abstract | We analyse how our model handles /t/-deletion on a large corpus of transcribed speech, and show that the joint model can perform word segmentation and recover underlying /t/s. |
Background and related work | This permits us to develop a joint generative model for both word segmentation and variation which we plan to extend to handle more phenomena in future work. |
Background and related work | They do not aim for a joint model that also handles word segmentation , however, and rather than training their model on an actual corpus, they evaluate on constructed lists of examples, mimicking frequencies of real data. |
Experiments 4.1 The data | This allows us to investigate the strength of the statistical signal for the deletion rule without confounding it with the word segmentation performance, and to see how the different contextual settings uniform, right and left-right handle the data. |
Experiments 4.1 The data | Table 5: /t/-recovery F-scores when performing joint word segmention in the left-right setting, averaged over two runs (standard errors less than 2%). |
Experiments 4.1 The data | Finally, we are also interested to learn how well we can do word segmentation and underlying /t/-recovery jointly. |
Introduction | Computational models of word segmentation try to solve one of the first problems language learners have to face: breaking an unsegmented stream of sound segments into individual words. |
The computational model | Figure l: The graphical model for our joint model of word-final /t/-deletion and Bigram word segmentation . |
The computational model | Bayesian word segmentation models try to compactly represent the observed data in terms of a small set of units (word types) and a short analysis (a small number of word tokens). |
Abstract | Structural information in web text provides natural annotations for NLP problems such as word segmentation and parsing. |
Abstract | With Chinese word segmentation as a case study, experiments show that the segmenter enhanced with the Chinese wikipedia achieves significant improvement on a series of testing sets from different domains, even with a single classifier and local features. |
Introduction | Problems related to information retrieval, machine translation and social computing need fast and accurate text processing, for example, word segmentation and parsing. |
Introduction | Taking Chinese word segmentation for example, the state-of-the-art models (Xue and Shen, 2003; Ng and Low, 2004; Gao et al., 2005; Nakagawa and Uchimoto, 2007; Zhao and Kit, 2008; J iang et al., 2009; Zhang and Clark, 2010; Sun, 2011b; Li, 2011) are usually trained on human-annotated corpora such as the Penn Chinese Treebank (CTB) (Xue et al., 2005), and perform quite well on corresponding test sets. |
Introduction | (b) Knowledge for word segmentation |
Character-based Chinese Parsing | To produce character-level trees for Chinese NLP tasks, we develop a character-based parsing model, which can jointly perform word segmentation , POS tagging and phrase-structure parsing. |
Character-based Chinese Parsing | First, we split the original SHIFT action into SHIFT—SEPARATE (t) and SHIFT—APPEND, which jointly perform the word segmentation and POS tagging tasks. |
Character-based Chinese Parsing | The string features are used for word segmentation and POS tagging, and are adapted from a state-of-the-art joint segmentation and tagging model (Zhang and Clark, 2010). |
Experiments | Since our model can jointly process word segmentation , POS tagging and phrase-structure parsing, we evaluate our model for the three tasks, respectively. |
Experiments | For word segmentation and POS tagging, standard metrics of word precision, recall and F-score are used, where the tagging accuracy is the joint accuracy of word segmentation and POS tagging. |
Introduction | With regard to task of parsing itself, an important advantage of the character-level syntax trees is that they allow word segmentation , part-of-speech (POS) tagging and parsing to be performed jointly, using an efficient CKY-style or shift-reduce algorithm. |
Introduction | To analyze word structures in addition to phrase structures, our character-based parser naturally performs joint word segmentation , POS tagging and parsing jointly. |
Introduction | We extend their shift-reduce framework, adding more transition actions for word segmentation and POS tagging, and defining novel features that capture character information. |
Word Structures and Syntax Trees | They made use of this information to help joint word segmentation and POS tagging. |
Word Structures and Syntax Trees | For leaf characters, we follow previous work on word segmentation (Xue, 2003; Ng and Low, 2004), and use “b” and “i” to indicate the beginning and non-beginning characters of a word, respectively. |
Abstract | While no segmented corpus of micro-blogs is available to train Chinese word segmentation model, existing Chinese word segmentation tools cannot perform equally well as in ordinary news texts. |
Abstract | In this paper we present an effective yet simple approach to Chinese word segmentation of micro-blog. |
Experiment | We use the benchmark datasets provided by the second International Chinese Word Segmentation Bakeoff2 as the labeled data. |
Experiment | The first two are both famous Chinese word segmentation tools: ICTCLAS3 and Stanford Chinese word segmenter4, which are widely used in NLP related to word segmentation . |
Experiment | Stanford Chinese word segmenter is a CRF-based segmentation tool and its segmentation standard is chosen as the PKU standard, which is the same to ours. |
INTRODUCTION | These new features of micro-blogs make the Chinese Word Segmentation (CWS) models trained on the source domain, such as news corpus, fail to perform equally well when transferred to texts from micro-blogs. |
Our method | Chinese word segmentation problem might be treated as a character labeling problem which gives each character a label indicating its position in one word. |
Related Work | Recent studies show that character sequence labeling is an effective formulation of Chinese word segmentation (Low et al., 2005; Zhao et al., 2006a,b; Chen et al., 2006; Xue, 2003). |
Related Work | On the other hand unsupervised word segmentation Peng and Schu-urmans (2001); Goldwater et al. |
Related Work | (1998) takes advantage of the huge amount of raw text to solve Chinese word segmentation problems. |
Abstract | This paper introduces a graph-based semi-supervised joint model of Chinese word segmentation and part-of-speech tagging. |
Introduction | Word segmentation and part-of-speech (POS) tagging are two critical and necessary initial procedures with respect to the majority of high-level Chinese language processing tasks such as syntax parsing, information extraction and machine translation. |
Introduction | The joint approaches of word segmentation and POS tagging (joint S&T) are proposed to resolve these two tasks simultaneously. |
Introduction | As far as we know, however, these methods have not yet been applied to resolve the problem of joint Chinese word segmentation (CWS) and POS tagging. |
Method | The performance measurement indicators for word segmentation and POS tagging (joint S&T) are balance F-score, F = 2PIU(P+R), the harmonic mean of precision (P) and recall (R), and out-of-vocabulary recall (OOV—R). |
Method | This outcome verifies the commonly accepted fact that the joint model can substantially improve the pipeline one, since POS tags provide additional information to word segmentation (Ng and Low, 2004). |
Method | Overall, for word segmentation , it obtains average improvements of 1.43% and 8.09% in F-score and OOV—R over others; for POS tagging, it achieves average improvements of 1.09% and 7.73%. |
Abstract | We exploit this reliance as an opportunity: recognizing the relation between informal word recognition and Chinese word segmentation , we propose to model the two tasks jointly. |
Conclusion | There is a close dependency between Chinese word segmentation (CWS) and informal word recognition (IWR). |
Introduction | This example illustrates the mutual dependency between Chinese word segmentation (henceforth, CWS) and informal word recognition (IWR) that should be solved jointly. |
Methodology | Given an input Chinese microblog post, our method simultaneously segments the sentences into words (the Chinese Word Segmentation , CWS, task), and marks the component words as informal or formal ones (the Informal Word Re-congition, IWR, task). |
Methodology | Character-based sequence labeling is employed for word segmentation due to its simplicity and robustness to the unknown word problem (Xue, 2003). |
Related Work | Closely related to our work is the task of Chinese new word detection, normally treated as a separate process from word segmentation in most previous works (Chen and Bai, 1998; Wu and Jiang, 2000; Chen and Ma, 2002; Gao et al., 2005). |
Abstract | This paper presents a semi-supervised Chinese word segmentation (CWS) approach that co-regularizes character-based and word-based models. |
Experiment | Table 2 shows the F-score results of word segmentation on CTB-5, CTB-6 and CTB-7 testing sets. |
Experiment | It is a supervised joint model of word segmentation , POS tagging and dependency parsing. |
Introduction | Chinese word segmentation (CWS) is a critical and a necessary initial procedure with respect to the majority of high-level Chinese language processing tasks such as syntax parsing, information extraction and machine translation, since Chinese scripts are written in continuous characters without explicit word boundaries. |
Segmentation Models | Character-based models treat word segmentation as a sequence labeling problem, assigning labels to the characters in a sentence indicating their positions in a word. |
Abstract | We study the mathematical properties of a recently proposed MDL-based unsupervised word segmentation algorithm, called regularized compression. |
Concluding Remarks | A natural extension of this work is to reproduce this result on some other word segmentation benchmarks, specifically those in other Asian languages (Emerson, 2005; Zhikov et al., 2010). |
Introduction | Hierarchical Bayes methods have been mainstream in unsupervised word segmentation since the dawn of hierarchical Dirichlet process (Goldwater et al., 2009) and adaptors grammar (Johnson and Goldwater, 2009). |
Abstract | Transliterated compound nouns not separated by whitespaces pose difficulty on word segmentation 0f-flz'ne approaches have been proposed to split them using word statistics, but they rely on static lexicon, limiting their use. |
Introduction | Accurate word segmentation (WS) is the key components in successful language processing. |
Introduction | For example, when splitting a compound noun 7 fiafi‘y‘ynl/‘yF‘ bumkz’sshureddo, a traditional word segmenter can easily segment this as 7 fififi‘y/‘an/yh“ “*blacki shred” since Val/y F‘ shureddo “shred” is a known, frequent word. |
Experiments | The first category is caused by incorrect word segmentation (40.85%). |
Experiments | The result of word segmentation directly decide the performance of extraction so it causes most of the errors. |
Experiments | In the future, we can improve the performance of WikiCiKE by polishing the word segmentation result. |