Abstract | Character-level information can benefit downstream applications by offering flexible granularities for word segmentation while improving word-level dependency parsing accuracies. |
Character-Level Dependency Tree | We differentiate intra-word dependencies and inter-word dependencies by the arc type, so that our work can be compared with conventional word segmentation , POS-tagging and dependency parsing pipelines under a canonical segmentation standard. |
Character-Level Dependency Tree | The character-level dependency trees hold to a specific word segmentation standard, but are not limited to it. |
Character-Level Dependency Tree | A transition-based framework with global learning and beam search decoding (Zhang and Clark, 2011) has been applied to a number of natural language processing tasks, including word segmentation , PCS-tagging and syntactic parsing (Zhang and Clark, 2010; Huang and Sagae, 2010; Bohnet and Nivre, 2012; Zhang et al., 2013). |
Introduction | Chinese dependency trees were conventionally defined over words (Chang et al., 2009; Li et al., 2012), requiring word segmentation and POS-tagging as preprocessing steps. |
Introduction | First, character-level trees circumvent the issue that no universal standard exists for Chinese word segmentation . |
Introduction | In the well-known Chinese word segmentation bakeoff tasks, for example, different segmentation standards have been used by different data sets (Emerson, 2005). |
Abstract | While no segmented corpus of micro-blogs is available to train Chinese word segmentation model, existing Chinese word segmentation tools cannot perform equally well as in ordinary news texts. |
Abstract | In this paper we present an effective yet simple approach to Chinese word segmentation of micro-blog. |
Experiment | We use the benchmark datasets provided by the second International Chinese Word Segmentation Bakeoff2 as the labeled data. |
Experiment | The first two are both famous Chinese word segmentation tools: ICTCLAS3 and Stanford Chinese word segmenter4, which are widely used in NLP related to word segmentation . |
Experiment | Stanford Chinese word segmenter is a CRF-based segmentation tool and its segmentation standard is chosen as the PKU standard, which is the same to ours. |
INTRODUCTION | These new features of micro-blogs make the Chinese Word Segmentation (CWS) models trained on the source domain, such as news corpus, fail to perform equally well when transferred to texts from micro-blogs. |
Our method | Chinese word segmentation problem might be treated as a character labeling problem which gives each character a label indicating its position in one word. |
Related Work | Recent studies show that character sequence labeling is an effective formulation of Chinese word segmentation (Low et al., 2005; Zhao et al., 2006a,b; Chen et al., 2006; Xue, 2003). |
Related Work | On the other hand unsupervised word segmentation Peng and Schu-urmans (2001); Goldwater et al. |
Related Work | (1998) takes advantage of the huge amount of raw text to solve Chinese word segmentation problems. |
Character-based Chinese Parsing | To produce character-level trees for Chinese NLP tasks, we develop a character-based parsing model, which can jointly perform word segmentation , POS tagging and phrase-structure parsing. |
Character-based Chinese Parsing | First, we split the original SHIFT action into SHIFT—SEPARATE (t) and SHIFT—APPEND, which jointly perform the word segmentation and POS tagging tasks. |
Character-based Chinese Parsing | The string features are used for word segmentation and POS tagging, and are adapted from a state-of-the-art joint segmentation and tagging model (Zhang and Clark, 2010). |
Experiments | Since our model can jointly process word segmentation , POS tagging and phrase-structure parsing, we evaluate our model for the three tasks, respectively. |
Experiments | For word segmentation and POS tagging, standard metrics of word precision, recall and F-score are used, where the tagging accuracy is the joint accuracy of word segmentation and POS tagging. |
Introduction | With regard to task of parsing itself, an important advantage of the character-level syntax trees is that they allow word segmentation , part-of-speech (POS) tagging and parsing to be performed jointly, using an efficient CKY-style or shift-reduce algorithm. |
Introduction | To analyze word structures in addition to phrase structures, our character-based parser naturally performs joint word segmentation , POS tagging and parsing jointly. |
Introduction | We extend their shift-reduce framework, adding more transition actions for word segmentation and POS tagging, and defining novel features that capture character information. |
Word Structures and Syntax Trees | They made use of this information to help joint word segmentation and POS tagging. |
Word Structures and Syntax Trees | For leaf characters, we follow previous work on word segmentation (Xue, 2003; Ng and Low, 2004), and use “b” and “i” to indicate the beginning and non-beginning characters of a word, respectively. |
Abstract | Structural information in web text provides natural annotations for NLP problems such as word segmentation and parsing. |
Abstract | With Chinese word segmentation as a case study, experiments show that the segmenter enhanced with the Chinese wikipedia achieves significant improvement on a series of testing sets from different domains, even with a single classifier and local features. |
Introduction | Problems related to information retrieval, machine translation and social computing need fast and accurate text processing, for example, word segmentation and parsing. |
Introduction | Taking Chinese word segmentation for example, the state-of-the-art models (Xue and Shen, 2003; Ng and Low, 2004; Gao et al., 2005; Nakagawa and Uchimoto, 2007; Zhao and Kit, 2008; J iang et al., 2009; Zhang and Clark, 2010; Sun, 2011b; Li, 2011) are usually trained on human-annotated corpora such as the Penn Chinese Treebank (CTB) (Xue et al., 2005), and perform quite well on corresponding test sets. |
Introduction | (b) Knowledge for word segmentation |
Abstract | Inspired by experimental psychological findings suggesting that function words play a special role in word learning, we make a simple modification to an Adaptor Grammar based Bayesian word segmentation model to allow it to learn sequences of monosyllabic “function words” at the beginnings and endings of collocations of (possibly multisyllabic) words. |
Abstract | This modification improves unsupervised word segmentation on the standard Bernstein-Ratner (1987) corpus of child-directed English by more than 4% token f-score compared to a model identical except that it does not special-case “function words”, setting a new state-of-the-art of 92.4% token f-score. |
Introduction | We do this by comparing two computational models of word segmentation which differ solely in the way that they model function words. |
Introduction | (1996) and Brent (1999) our word segmentation models identify word boundaries from unsegmented sequences of phonemes corresponding to utterances, effectively performing unsupervised learning of a lexicon. |
Introduction | a word segmentation model should segment this as ju want tu si 69 buk, which is the IPA representation of “you want to see the book”. |
Word segmentation with Adaptor Grammars | Perhaps the simplest word segmentation model is the unigram model, where utterances are modeled as sequences of words, and where each word is a sequence of segments (Brent, 1999; Goldwater et al., 2009). |
Abstract | Current computational models of unsupervised word segmentation usually assume idealized input that is devoid of these kinds of variation. |
Abstract | We extend a nonparametric model of word segmentation by adding phonological rules that map from underlying forms to surface forms to produce a mathematically well-defined joint model as a first step towards handling variation and segmentation in a single model. |
Abstract | We analyse how our model handles /t/-deletion on a large corpus of transcribed speech, and show that the joint model can perform word segmentation and recover underlying /t/s. |
Background and related work | This permits us to develop a joint generative model for both word segmentation and variation which we plan to extend to handle more phenomena in future work. |
Background and related work | They do not aim for a joint model that also handles word segmentation , however, and rather than training their model on an actual corpus, they evaluate on constructed lists of examples, mimicking frequencies of real data. |
Experiments 4.1 The data | This allows us to investigate the strength of the statistical signal for the deletion rule without confounding it with the word segmentation performance, and to see how the different contextual settings uniform, right and left-right handle the data. |
Experiments 4.1 The data | Table 5: /t/-recovery F-scores when performing joint word segmention in the left-right setting, averaged over two runs (standard errors less than 2%). |
Experiments 4.1 The data | Finally, we are also interested to learn how well we can do word segmentation and underlying /t/-recovery jointly. |
Introduction | Computational models of word segmentation try to solve one of the first problems language learners have to face: breaking an unsegmented stream of sound segments into individual words. |
The computational model | Figure l: The graphical model for our joint model of word-final /t/-deletion and Bigram word segmentation . |
The computational model | Bayesian word segmentation models try to compactly represent the observed data in terms of a small set of units (word types) and a short analysis (a small number of word tokens). |
Abstract | We show for both English POS tagging and Chinese word segmentation that with proper representation, large number of deterministic constraints can be learned from training examples, and these are useful in constraining probabilistic inference. |
Abstract | In this work, we explore deterministic constraints for two fundamental NLP problems, English POS tagging and Chinese word segmentation . |
Abstract | For Chinese word segmentation (CWS), which can be formulated as character tagging, analogous constraints can be learned with the same templates as English POS tagging. |
Abstract | We present a joint model for Chinese word segmentation and new word detection. |
Abstract | As we know, training a word segmentation system on large-scale datasets is already costly. |
Introduction | The major problem of Chinese word segmentation is the ambiguity. |
Introduction | In this paper, we present high dimensional new features, including word-based features and enriched edge (label-transition) features, for the joint modeling of Chinese word segmentation (CWS) and new word detection (NWD). |
Introduction | As we know, training a word segmentation system on large-scale datasets is already costly. |
Related Work | First, we review related work on word segmentation and new word detection. |
Related Work | 2.1 Word Segmentation and New Word Detection |
Related Work | Conventional approaches to Chinese word segmentation treat the problem as a sequential labeling task (Xue, 2003; Peng et al., 2004; Tseng et al., 2005; Asahara et al., 2005; Zhao et al., 2010). |
Abstract | In this paper, we propose a novel neural network model for Chinese word segmentation called Max-Margin Tensor Neural Network (MMTNN). |
Abstract | Despite Chinese word segmentation being a specific case, MMTNN can be easily generalized and applied to other sequence labeling tasks. |
Conventional Neural Network | Formally, in the Chinese word segmentation task, we have a character dictionary D of size Unless otherwise specified, the character dictionary is extracted from the training set and unknown characters are mapped to a special symbol that is not used elsewhere. |
Conventional Neural Network | In Chinese word segmentation , the most prevalent tag set T is BMES tag set, which uses 4 tags to carry word boundary information. |
Conventional Neural Network | (2013) modeled Chinese word segmentation as a series of |
Introduction | Therefore, word segmentation is a preliminary and important pre-process for Chinese language processing. |
Introduction | (2011) to Chinese word segmentation and POS tagging and proposed a perceptron-style algorithm to speed up the training process with negligible loss in performance. |
Introduction | We evaluate the performance of Chinese word segmentation on the PKU and MSRA benchmark datasets in the second International Chinese Word Segmentation Bakeoff (Emerson, 2005) which are commonly used for evaluation of Chinese word segmentation . |
Abstract | The focus of recent studies on Chinese word segmentation , part-of-speech (POS) tagging and parsing has been shifting from words to characters. |
Abstract | We propose a method that performs character-level POS tagging jointly with word segmentation and word-level POS tagging. |
Chinese Morphological Analysis with Character-level POS | Previous studies have shown that jointly processing word segmentation and POS tagging is preferable to pipeline processing, which can propagate errors (Nakagawa and Uchimoto, 2007; Kruengkrai et a1., 2009). |
Conclusion | A Cascaded Linear Model for Joint Chinese Word Segmentation and Part-of-speech Tagging. |
Conclusion | Word Lattice Reranking for Chinese Word Segmentation |
Evaluation | To evaluate our proposed method, we have conducted two sets of experiments on CTB5: word segmentation, and joint word segmentation and word-level POS tagging. |
Evaluation | The results of the word segmentation experiment and the joint experiment of segmentation and POS tagging are shown in Table 5(a) and Table 5(b), respectively. |
Evaluation | The results show that, while the differences between the baseline model and the proposed model in word segmentation accuracies are small, the proposed model achieves significant improvement in the experiment of joint segmentati- |
Introduction | In recent years, the focus of research on Chinese word segmentation , part-of-speech (POS) tagging and parsing has been shifting from words toward characters. |
Introduction | We propose a method that performs character-level POS tagging jointly with word segmentation and word-level POS tagging. |
Abstract | We propose the first joint model for word segmentation , POS tagging, and dependency parsing for Chinese. |
Introduction | spaces) between words, word segmentation is the crucial first step that is necessary to perform Virtually all NLP tasks. |
Introduction | Because the tasks of word segmentation and POS tagging have strong interactions, many studies have been devoted to the task of joint word segmentation and POS tagging for languages such as Chinese (e.g. |
Introduction | The joint approach to word segmentation and POS tagging has been reported to improve word segmentation and POS tagging accuracies by more than |
Model | (2011), we build our joint model to solve word segmentation , POS tagging, and dependency parsing within a single framework. |
Model | In our joint model, the early update is invoked by mistakes in any of word segmentation , POS tagging, or dependency parsing. |
Related Works | In addition, the lattice does not include word segmentation ambiguities crossing boundaries of space-delimited tokens. |
Related Works | However, because they regarded word segmentation as given, their model did not consider the |
Abstract | In this paper, we propose a new Bayesian model for fully unsupervised word segmentation and an efficient blocked Gibbs sampler combined with dynamic programming for inference. |
Abstract | We confirmed that it significantly outperforms previous reported results in both phonetic transcripts and standard datasets for Chinese and Japanese word segmentation . |
Inference | To find the hidden word segmentation w of a string 3 = 01 - - - c N, which is equivalent to the vector of binary hidden variables 2 = 21 - - - ZN, the simplest approach is to build a Gibbs sampler that randomly selects a character c,- and draw a binary decision 2,- as to whether there is a word boundary, and then update the language model according to the new segmentation (Goldwater et al., 2006; Xu et al., 2008). |
Introduction | Asian languages such as Chinese and Japanese have no explicit word boundaries, thus word segmentation is a crucial first step when processing them. |
Introduction | In order to extract “words” from text streams, unsupervised word segmentation is an important research area because the criteria for creating supervised training data could be arbitrary, and will be suboptimal for applications that rely on segmentations. |
Introduction | This maximizes the probability of word segmentation w given a string 3 : |
Nested Pitman-Yor Language Model | If a lexicon is finite, we can use a uniform prior G0(w) = l/|V| for every word 21) in lexicon V. However, with word segmentation every substring could be a word, thus the lexicon is not limited but will be countably infinite. |
Nested Pitman-Yor Language Model | Building an accurate G0 is crucial for word segmentation , since it determines how the possible words will look like. |
Abstract | In this paper, we present a discriminative word-character hybrid model for joint Chinese word segmentation and POS tagging. |
Background | In joint word segmentation and the POS tagging process, the task is to predict a path |
Experiments | Previous studies on joint Chinese word segmentation and POS tagging have used Penn Chinese Treebank (CTB) (Xia et al., 2000) in experiments. |
Experiments | We evaluated both word segmentation (Seg) and joint word segmentation and POS tagging (Seg & Tag). |
Experiments | (2008a; 2008b) on CTB 5.0 and Zhang and Clark (2008) on CTB 4.0 since they reported the best performances on joint word segmentation and POS tagging using the training materials only derived from the corpora. |
Introduction | In Chinese, word segmentation and part-of-speech (POS) tagging are indispensable steps for higher-level NLP tasks. |
Introduction | Word segmentation and POS tagging results are required as inputs to other NLP tasks, such as phrase chunking, dependency parsing, and machine translation. |
Introduction | Word segmentation and POS tagging in a joint process have received much attention in recent research and have shown improvements over a pipelined fashion (Ng and Low, 2004; Nakagawa and Uchimoto, 2007; Zhang and Clark, 2008; Jiang et al., 2008a; Jiang et al., 2008b). |
Policies for correct path selection | 4In our experiments, the optimal threshold value 7" is selected by evaluating the performance of joint word segmentation and POS tagging on the development set. |
Related work | Maximum entropy models are widely used for word segmentation and POS tagging tasks (Uchimoto et al., 2001; Ng and Low, 2004; Nakagawa, 2004; Nakagawa and Uchimoto, 2007) since they only need moderate training times while they provide reasonable performance. |
Abstract | We test the efficacy of this method in the context of Chinese word segmentation and part-of-speech tagging, where no segmentation and POS tagging standards are widely accepted due to the lack of morphology in Chinese. |
Automatic Annotation Adaptation | Considering that word segmentation and Joint S&T can be conducted in the same character classification manner, we can design an unified standard adaptation framework for the two tasks, by taking the source classifier’s classification result as the guide information for the target classifier’s classification decision. |
Introduction | Figure l: Incompatible word segmentation and POS tagging standards between CTB (upper) and People’s Daily (below). |
Introduction | To test the efficacy of our method we choose Chinese word segmentation and part-of-speech tagging, where the problem of incompatible annotation standards is one of the most evident: so far no segmentation standard is widely accepted due to the lack of a clear definition of Chinese words, and the (almost complete) lack of morphology results in much bigger ambiguities and heavy debates in tagging philosophies for Chinese parts-of-speech. |
Introduction | In addition, the improved accuracies from segmentation and tagging also lead to an improved parsing accuracy on CTB, reducing 38% of the error propagation from word segmentation to parsing. |
Segmentation and Tagging as Character Classification | 01 02 .. On where C,- is a character, word segmentation aims to split the sequence into m(§ n) words: 01161 Cel+lzeg -- Cem_1+1:em |
Segmentation and Tagging as Character Classification | Xue and Shen (2003) describe for the first time the character classification approach for Chinese word segmentation , Where each character is given a boundary tag denoting its relative position in a word. |
Segmentation and Tagging as Character Classification | In addition, Ng and Low (2004) find that, compared with POS tagging after word segmentation , Joint S&T can achieve higher accuracy on both segmentation and POS tagging. |
Introduction | We show that simultaneously learning syllable structure and collocations improves word segmentation accuracy compared to models that learn these independently. |
Introduction | This paper applies adaptor grammars to word segmentation and morphological acquisition. |
Word segmentation with adaptor grammars | We now turn to linguistic applications of adaptor grammars, specifically, to models of unsupervised word segmentation . |
Word segmentation with adaptor grammars | Table 1: Word segmentation f-score results for all models, as a function of DP concentration parameter oz. |
Word segmentation with adaptor grammars | Table 1 summarizes the word segmentation f-scores for all models described in this paper. |
About Heterogeneous Annotations | For Chinese word segmentation and POS tagging, supervised learning has become a dominant paradigm. |
About Heterogeneous Annotations | Take Chinese word segmentation for example. |
Abstract | We address the issue of consuming heterogeneous annotation data for Chinese word segmentation and part-of-speech tagging. |
Conclusion | Our theoretical and empirical analysis of two representative popular corpora highlights two essential characteristics of heterogeneous annotations which are eXplored to reduce approximation and estimation errors for Chinese word segmentation and POS tagging. |
Experiments | Previous studies on joint Chinese word segmentation and POS tagging have used the CTB in experiments. |
Introduction | This paper explores heterogeneous annotations to reduce both approximation and estimation errors for Chinese word segmentation and part-of-speech (POS) tagging, which are fundamental steps for more advanced Chinese language processing tasks. |
Introduction | In particular, joint word segmentation and POS tagging is addressed as a two step process. |
Joint Chinese Word Segmentation and POS Tagging | words, word segmentation and POS tagging are important initial steps for Chinese language processing. |
Joint Chinese Word Segmentation and POS Tagging | Two kinds of approaches are popular for joint word segmentation and POS tagging. |
Abstract | This study investigates on building a better Chinese word segmentation model for statistical machine translation. |
Experiments | The influence of the word segmentation on the final translation is our main investigation. |
Experiments | Firstly, as expected, having word segmentation does help Chinese-to-English MT. |
Experiments | This section aims to further analyze the three primary observations concluded in Section 4.3: 2') word segmentation is useful to SMT; ii) the treebank and the bilingual segmentation knowledge are helpful, performing segmentation of different nature; and iii) the bilingual constraints lead to learn segmentations better tailored for SMT. |
Introduction | Word segmentation is regarded as a critical procedure for high-level Chinese language processing tasks, since Chinese scripts are written in continuous characters without explicit word boundaries (e.g., space in English). |
Introduction | The empirical works show that word segmentation can be beneficial to Chinese-to-English statistical machine translation (SMT) (Xu et al., 2005; Chang et al., 2008; Zhao et al., 2013). |
Introduction | The practice in state-of-the-art MT systems is that Chinese sentences are tokenized by a monolingual supervised word segmentation model trained on the hand-annotated treebank data, e.g., Chinese treebank |
Abstract | This paper introduces a graph-based semi-supervised joint model of Chinese word segmentation and part-of-speech tagging. |
Introduction | Word segmentation and part-of-speech (POS) tagging are two critical and necessary initial procedures with respect to the majority of high-level Chinese language processing tasks such as syntax parsing, information extraction and machine translation. |
Introduction | The joint approaches of word segmentation and POS tagging (joint S&T) are proposed to resolve these two tasks simultaneously. |
Introduction | As far as we know, however, these methods have not yet been applied to resolve the problem of joint Chinese word segmentation (CWS) and POS tagging. |
Method | The performance measurement indicators for word segmentation and POS tagging (joint S&T) are balance F-score, F = 2PIU(P+R), the harmonic mean of precision (P) and recall (R), and out-of-vocabulary recall (OOV—R). |
Method | This outcome verifies the commonly accepted fact that the joint model can substantially improve the pipeline one, since POS tags provide additional information to word segmentation (Ng and Low, 2004). |
Method | Overall, for word segmentation , it obtains average improvements of 1.43% and 8.09% in F-score and OOV—R over others; for POS tagging, it achieves average improvements of 1.09% and 7.73%. |
Conclusion | TESLA-CELAB does not have a segmentation step, hence it will not introduce word segmentation errors. |
Discussion and Future Work | Chinese word segmentation . |
Experiments | We use the Stanford Chinese word segmenter (Tseng et al., 2005) and POS tagger (Toutanova et al., 2003) for preprocessing and Cilin for synonym |
Experiments | Note also that the word segmentations shown in these examples are for clarity only. |
Introduction | The most obvious challenge for Chinese is that of word segmentation . |
Introduction | However, many different segmentation standards eXist for different purposes, such as Microsoft Research Asia (MSRA) for Named Entity Recognition (NER), Chinese Treebank (CTB) for parsing and part-of-speech (POS) tagging, and City University of Hong Kong (CITYU) and Academia Sinica (AS) for general word segmentation and POS tagging. |
Introduction | The only prior work attempting to address the problem of word segmentation in automatic MT evaluation for Chinese that we are aware of is Li et |
Motivation | Character-based metrics do not suffer from errors and differences in word segmentation , so and ¥_l—?fi_5lk would be judged exactly equal. |
Abstract | Automatic extraction of new words is an indispensable precursor to many NLP tasks such as Chinese word segmentation , named entity extraction, and sentiment analysis. |
Experiment | The posts were then part-of-speech tagged using a Chinese word segmentation tool named ICTCLAS (Zhang et al., 2003). |
Introduction | Automatic extraction of new words is indispensable to many tasks such as Chinese word segmentation , machine translation, named entity extraction, question answering, and sentiment analysis. |
Introduction | New word detection is one of the most critical issues in Chinese word segmentation . |
Introduction | Recent studies (Sproat and Emerson, 2003) (Chen, 2003) have shown that more than 60% of word segmentation errors result from new words. |
Methodology | Obviously, in order to obtain the value of 3(wi), some particular Chinese word segmentation tool is required. |
Related Work | New word detection has been usually interweaved with word segmentation , particularly in Chinese NLP. |
Related Work | In these works, new word detection is considered as an integral part of segmentation, where new words are identified as the most probable segments inferred by the probabilistic models; and the detected new word can be further used to improve word segmentation . |
Abstract | Unsupervised word segmentation (UWS) can provide domain-adaptive segmentation for statistical machine translation (SMT) without annotated data, and bilingual UWS can even optimize segmentation for alignment. |
Complexity Analysis | The proposed method does not require any annotated data, but the SMT system with it can achieve comparable performance compared to state-of-the-art supervised word segmenters trained on precious annotated data. |
Complexity Analysis | Moreover, the proposed method yields 0.96 BLEU improvement relative to supervised word segmenters on an out-of-domain corpus. |
Complexity Analysis | Thus, we believe that the proposed method would benefit SMT related to low-resource languages where annotated data are scare, and would also find application in domains that differ too greatly from the domains on which supervised word segmenters were trained. |
Introduction | Many languages, especially Asian languages such as Chinese, Japanese and Myanmar, have no explicit word boundaries, thus word segmentation (WS), that is, segmenting the continuous texts of these languages into isolated words, is a prerequisite for many natural language processing applications including SMT. |
Introduction | o improvement of BLEU scores compared to supervised Stanford Chinese word segmenter . |
Experiments | Maximum matching word segmentation is used with a large word vocabulary V extracted from web data provided by (Wang et al., 2013b). |
Pinyin Input Method Model | Without word delimiters, linguists have argued on what a Chinese word really is for a long time and that is why there is always a primary word segmentation treatment in most Chinese language processing tasks (Zhao et al., 2006; Huang and Zhao, 2007; Zhao and Kit, 2008; Zhao et al., 2010; Zhao and Kit, 2011; Zhao et al., 2013). |
Pinyin Input Method Model | A Chinese word may contain from 1 to over 10 characters due to different word segmentation conventions. |
Pinyin Input Method Model | Nevertheless, pinyin syllable segmentation is a much easier problem compared to Chinese word segmentation . |
Abstract | We exploit this reliance as an opportunity: recognizing the relation between informal word recognition and Chinese word segmentation , we propose to model the two tasks jointly. |
Conclusion | There is a close dependency between Chinese word segmentation (CWS) and informal word recognition (IWR). |
Introduction | This example illustrates the mutual dependency between Chinese word segmentation (henceforth, CWS) and informal word recognition (IWR) that should be solved jointly. |
Methodology | Given an input Chinese microblog post, our method simultaneously segments the sentences into words (the Chinese Word Segmentation , CWS, task), and marks the component words as informal or formal ones (the Informal Word Re-congition, IWR, task). |
Methodology | Character-based sequence labeling is employed for word segmentation due to its simplicity and robustness to the unknown word problem (Xue, 2003). |
Related Work | Closely related to our work is the task of Chinese new word detection, normally treated as a separate process from word segmentation in most previous works (Chen and Bai, 1998; Wu and Jiang, 2000; Chen and Ma, 2002; Gao et al., 2005). |
Abstract | The Omni-word feature uses every potential word in a sentence as lexicon feature, reducing errors caused by word segmentation . |
Abstract | Both Omni-word feature and soft constraint make a better use of sentence information and minimize the influences caused by Chinese word segmentation and parsing. |
Feature Construction | On the other hand, the Omni-word can avoid these problems and take advantages of Chinese characteristics (the word-formation and the ambiguity of word segmentation ). |
Introduction | Lacking of orthographic word makes Chinese word segmentation difficult. |
Related Work | (2008; 2010) also pointed out that, due to the inaccuracy of Chinese word segmentation and parsing, the tree kernel based approach is inappropriate for Chinese relation extraction. |
A single character is used if no suffix occurs 10 times. | In a full understanding system, output of the word segmenter would be passed to morphological and local syntactic processing. |
A single character is used if no suffix occurs 10 times. | Because standard models of morphological learning don’t address the interaction with word segmentation , WordEnds does a simple version of this repair process using a placeholder algorithm called Mini-morph. |
Previous work | Word segmentation experiments by Christiansen and Allen (1997) and Harrington et al. |
The task in more detail | The datasets are informal conversations in which debatable word segmentations are rare. |
The task in more detail | A theory of word segmentation must explain how affixes differ from freestanding function words. |
Abstract | This paper presents a semi-supervised Chinese word segmentation (CWS) approach that co-regularizes character-based and word-based models. |
Experiment | Table 2 shows the F-score results of word segmentation on CTB-5, CTB-6 and CTB-7 testing sets. |
Experiment | It is a supervised joint model of word segmentation , POS tagging and dependency parsing. |
Introduction | Chinese word segmentation (CWS) is a critical and a necessary initial procedure with respect to the majority of high-level Chinese language processing tasks such as syntax parsing, information extraction and machine translation, since Chinese scripts are written in continuous characters without explicit word boundaries. |
Segmentation Models | Character-based models treat word segmentation as a sequence labeling problem, assigning labels to the characters in a sentence indicating their positions in a word. |
Experiments | This version, which contains 9790 utterances (33399 tokens, 1321 types), is now standard for word segmentation , but contains no phonetic variability. |
Experiments | As a simple extension of our model to the case of unknown word boundaries, we interleave it with an existing model of word segmentation , olpseg (Gold- |
Introduction | For example, many models of word segmentation implicitly or explicitly build a lexicon while segmenting the input stream of phonemes into word tokens; in nearly all cases the phonemic input is created from an orthographic transcription using a phonemic dictionary, thus abstracting away from any phonetic variability (Brent, 1999; Venkataraman, 2001; Swingley, 2005; Goldwater et al., 2009, among others). |
Related work | A final line of related work is on word segmentation . |
Related work | In addition to the models mentioned in Section 1, which use phonemic input, a few models of word segmentation have been tested using phonetic input (Fleck, 2008; Rytting, 2007; Daland and Pierrehum—bert, 2010). |
Experiments and Analysis | CDT and CTB5/6 adopt different POS tag sets, and converting from one tag set to another is difficult (Niu et al., 2009).5 To overcome this problem, we use the People’s Daily corpus (PD),6 a large—scale corpus annotated with word segmentation and POS tags, to train a statistical POS tagger. |
Experiments and Analysis | 5 The word segmentation standards of the two treebanks also slightly differs, which are not considered in this work. |
Experiments and Analysis | Moreover, inferior results may be gained due to the differences between CTB5 and PD in word segmentation standards and text sources. |
Related Work | (2009) improve the performance of word segmentation and part—of—speech (POS) tagging on CTBS using another large—scale corpus of different annotation standards (People’s Daily). |
Exploiting the Translated Treebank | However, Chinese has a special primary processing task, i.e., word segmentation . |
Exploiting the Translated Treebank | Note that CTB or any other Chinese treebank has its own word segmentation guideline. |
Exploiting the Translated Treebank | English treebank is translated into Chinese word by word, Chinese words in the translated text are exactly some entries from the bilingual lexicon, they are actually irregular phrases, short sentences or something else rather than words that follows any existing word segmentation convention. |
Experiments | The first category is caused by incorrect word segmentation (40.85%). |
Experiments | The result of word segmentation directly decide the performance of extraction so it causes most of the errors. |
Experiments | In the future, we can improve the performance of WikiCiKE by polishing the word segmentation result. |
Abstract | Transliterated compound nouns not separated by whitespaces pose difficulty on word segmentation 0f-flz'ne approaches have been proposed to split them using word statistics, but they rely on static lexicon, limiting their use. |
Introduction | Accurate word segmentation (WS) is the key components in successful language processing. |
Introduction | For example, when splitting a compound noun 7 fiafi‘y‘ynl/‘yF‘ bumkz’sshureddo, a traditional word segmenter can easily segment this as 7 fififi‘y/‘an/yh“ “*blacki shred” since Val/y F‘ shureddo “shred” is a known, frequent word. |
Abstract | We study the mathematical properties of a recently proposed MDL-based unsupervised word segmentation algorithm, called regularized compression. |
Concluding Remarks | A natural extension of this work is to reproduce this result on some other word segmentation benchmarks, specifically those in other Asian languages (Emerson, 2005; Zhikov et al., 2010). |
Introduction | Hierarchical Bayes methods have been mainstream in unsupervised word segmentation since the dawn of hierarchical Dirichlet process (Goldwater et al., 2009) and adaptors grammar (Johnson and Goldwater, 2009). |
Capturing Paradigmatic Relations via Word Clustering | 3.1.3 Preprocessing: Word Segmentation |
Capturing Paradigmatic Relations via Word Clustering | In this table, the symbol “+” in the Features column means current configuration contains both the baseline features and new cluster-based features; the number is the total number of the clusters; the symbol “+” in the Data column means which portion of the Gigaword data is used to cluster words; the symbol “S” and “SS” in parentheses denote (s)upervised and (s)emi-(s)upervised word segmentation . |
State-of-the-Art | Penn Chinese Treebank (CTB) (Xue et al., 2005) is a popular data set to evaluate a number of Chinese NLP tasks, including word segmentation (Sun and |
Previous Work | Nakagawa (2004) combine word-level and character-level information for Chinese and Japanese word segmentation . |
Previous Work | (of all words in a given sentence) and the POS tagging (of the known words) is based on a Viterbi search over a lattice composed of all possible word segmentations and the possible classifications of all observed characters. |
Previous Work | Their experimental results show that the method achieves high accuracy over state-of-the-art methods for Chinese and Japanese word segmentation . |