Abstract | We propose the first joint model for word segmentation , POS tagging, and dependency parsing for Chinese. |
Introduction | spaces) between words, word segmentation is the crucial first step that is necessary to perform Virtually all NLP tasks. |
Introduction | Because the tasks of word segmentation and POS tagging have strong interactions, many studies have been devoted to the task of joint word segmentation and POS tagging for languages such as Chinese (e.g. |
Introduction | The joint approach to word segmentation and POS tagging has been reported to improve word segmentation and POS tagging accuracies by more than |
Model | (2011), we build our joint model to solve word segmentation , POS tagging, and dependency parsing within a single framework. |
Model | In our joint model, the early update is invoked by mistakes in any of word segmentation , POS tagging, or dependency parsing. |
Related Works | In addition, the lattice does not include word segmentation ambiguities crossing boundaries of space-delimited tokens. |
Related Works | However, because they regarded word segmentation as given, their model did not consider the |
Abstract | We present a joint model for Chinese word segmentation and new word detection. |
Abstract | As we know, training a word segmentation system on large-scale datasets is already costly. |
Introduction | The major problem of Chinese word segmentation is the ambiguity. |
Introduction | In this paper, we present high dimensional new features, including word-based features and enriched edge (label-transition) features, for the joint modeling of Chinese word segmentation (CWS) and new word detection (NWD). |
Introduction | As we know, training a word segmentation system on large-scale datasets is already costly. |
Related Work | First, we review related work on word segmentation and new word detection. |
Related Work | 2.1 Word Segmentation and New Word Detection |
Related Work | Conventional approaches to Chinese word segmentation treat the problem as a sequential labeling task (Xue, 2003; Peng et al., 2004; Tseng et al., 2005; Asahara et al., 2005; Zhao et al., 2010). |
Abstract | We show for both English POS tagging and Chinese word segmentation that with proper representation, large number of deterministic constraints can be learned from training examples, and these are useful in constraining probabilistic inference. |
Abstract | In this work, we explore deterministic constraints for two fundamental NLP problems, English POS tagging and Chinese word segmentation . |
Abstract | For Chinese word segmentation (CWS), which can be formulated as character tagging, analogous constraints can be learned with the same templates as English POS tagging. |
About Heterogeneous Annotations | For Chinese word segmentation and POS tagging, supervised learning has become a dominant paradigm. |
About Heterogeneous Annotations | Take Chinese word segmentation for example. |
Abstract | We address the issue of consuming heterogeneous annotation data for Chinese word segmentation and part-of-speech tagging. |
Conclusion | Our theoretical and empirical analysis of two representative popular corpora highlights two essential characteristics of heterogeneous annotations which are eXplored to reduce approximation and estimation errors for Chinese word segmentation and POS tagging. |
Experiments | Previous studies on joint Chinese word segmentation and POS tagging have used the CTB in experiments. |
Introduction | This paper explores heterogeneous annotations to reduce both approximation and estimation errors for Chinese word segmentation and part-of-speech (POS) tagging, which are fundamental steps for more advanced Chinese language processing tasks. |
Introduction | In particular, joint word segmentation and POS tagging is addressed as a two step process. |
Joint Chinese Word Segmentation and POS Tagging | words, word segmentation and POS tagging are important initial steps for Chinese language processing. |
Joint Chinese Word Segmentation and POS Tagging | Two kinds of approaches are popular for joint word segmentation and POS tagging. |
Conclusion | TESLA-CELAB does not have a segmentation step, hence it will not introduce word segmentation errors. |
Discussion and Future Work | Chinese word segmentation . |
Experiments | We use the Stanford Chinese word segmenter (Tseng et al., 2005) and POS tagger (Toutanova et al., 2003) for preprocessing and Cilin for synonym |
Experiments | Note also that the word segmentations shown in these examples are for clarity only. |
Introduction | The most obvious challenge for Chinese is that of word segmentation . |
Introduction | However, many different segmentation standards eXist for different purposes, such as Microsoft Research Asia (MSRA) for Named Entity Recognition (NER), Chinese Treebank (CTB) for parsing and part-of-speech (POS) tagging, and City University of Hong Kong (CITYU) and Academia Sinica (AS) for general word segmentation and POS tagging. |
Introduction | The only prior work attempting to address the problem of word segmentation in automatic MT evaluation for Chinese that we are aware of is Li et |
Motivation | Character-based metrics do not suffer from errors and differences in word segmentation , so and ¥_l—?fi_5lk would be judged exactly equal. |
Experiments | This version, which contains 9790 utterances (33399 tokens, 1321 types), is now standard for word segmentation , but contains no phonetic variability. |
Experiments | As a simple extension of our model to the case of unknown word boundaries, we interleave it with an existing model of word segmentation , olpseg (Gold- |
Introduction | For example, many models of word segmentation implicitly or explicitly build a lexicon while segmenting the input stream of phonemes into word tokens; in nearly all cases the phonemic input is created from an orthographic transcription using a phonemic dictionary, thus abstracting away from any phonetic variability (Brent, 1999; Venkataraman, 2001; Swingley, 2005; Goldwater et al., 2009, among others). |
Related work | A final line of related work is on word segmentation . |
Related work | In addition to the models mentioned in Section 1, which use phonemic input, a few models of word segmentation have been tested using phonetic input (Fleck, 2008; Rytting, 2007; Daland and Pierrehum—bert, 2010). |
Experiments and Analysis | CDT and CTB5/6 adopt different POS tag sets, and converting from one tag set to another is difficult (Niu et al., 2009).5 To overcome this problem, we use the People’s Daily corpus (PD),6 a large—scale corpus annotated with word segmentation and POS tags, to train a statistical POS tagger. |
Experiments and Analysis | 5 The word segmentation standards of the two treebanks also slightly differs, which are not considered in this work. |
Experiments and Analysis | Moreover, inferior results may be gained due to the differences between CTB5 and PD in word segmentation standards and text sources. |
Related Work | (2009) improve the performance of word segmentation and part—of—speech (POS) tagging on CTBS using another large—scale corpus of different annotation standards (People’s Daily). |
Capturing Paradigmatic Relations via Word Clustering | 3.1.3 Preprocessing: Word Segmentation |
Capturing Paradigmatic Relations via Word Clustering | In this table, the symbol “+” in the Features column means current configuration contains both the baseline features and new cluster-based features; the number is the total number of the clusters; the symbol “+” in the Data column means which portion of the Gigaword data is used to cluster words; the symbol “S” and “SS” in parentheses denote (s)upervised and (s)emi-(s)upervised word segmentation . |
State-of-the-Art | Penn Chinese Treebank (CTB) (Xue et al., 2005) is a popular data set to evaluate a number of Chinese NLP tasks, including word segmentation (Sun and |