Index of papers in Proc. ACL 2012 that mention
  • word segmentation
Hatori, Jun and Matsuzaki, Takuya and Miyao, Yusuke and Tsujii, Jun'ichi
Abstract
We propose the first joint model for word segmentation , POS tagging, and dependency parsing for Chinese.
Introduction
spaces) between words, word segmentation is the crucial first step that is necessary to perform Virtually all NLP tasks.
Introduction
Because the tasks of word segmentation and POS tagging have strong interactions, many studies have been devoted to the task of joint word segmentation and POS tagging for languages such as Chinese (e.g.
Introduction
The joint approach to word segmentation and POS tagging has been reported to improve word segmentation and POS tagging accuracies by more than
Model
(2011), we build our joint model to solve word segmentation , POS tagging, and dependency parsing within a single framework.
Model
In our joint model, the early update is invoked by mistakes in any of word segmentation , POS tagging, or dependency parsing.
Related Works
In addition, the lattice does not include word segmentation ambiguities crossing boundaries of space-delimited tokens.
Related Works
However, because they regarded word segmentation as given, their model did not consider the
word segmentation is mentioned in 17 sentences in this paper.
Topics mentioned in this paper:
Sun, Xu and Wang, Houfeng and Li, Wenjie
Abstract
We present a joint model for Chinese word segmentation and new word detection.
Abstract
As we know, training a word segmentation system on large-scale datasets is already costly.
Introduction
The major problem of Chinese word segmentation is the ambiguity.
Introduction
In this paper, we present high dimensional new features, including word-based features and enriched edge (label-transition) features, for the joint modeling of Chinese word segmentation (CWS) and new word detection (NWD).
Introduction
As we know, training a word segmentation system on large-scale datasets is already costly.
Related Work
First, we review related work on word segmentation and new word detection.
Related Work
2.1 Word Segmentation and New Word Detection
Related Work
Conventional approaches to Chinese word segmentation treat the problem as a sequential labeling task (Xue, 2003; Peng et al., 2004; Tseng et al., 2005; Asahara et al., 2005; Zhao et al., 2010).
word segmentation is mentioned in 23 sentences in this paper.
Topics mentioned in this paper:
Zhao, Qiuye and Marcus, Mitch
Abstract
We show for both English POS tagging and Chinese word segmentation that with proper representation, large number of deterministic constraints can be learned from training examples, and these are useful in constraining probabilistic inference.
Abstract
In this work, we explore deterministic constraints for two fundamental NLP problems, English POS tagging and Chinese word segmentation .
Abstract
For Chinese word segmentation (CWS), which can be formulated as character tagging, analogous constraints can be learned with the same templates as English POS tagging.
word segmentation is mentioned in 15 sentences in this paper.
Topics mentioned in this paper:
Sun, Weiwei and Wan, Xiaojun
About Heterogeneous Annotations
For Chinese word segmentation and POS tagging, supervised learning has become a dominant paradigm.
About Heterogeneous Annotations
Take Chinese word segmentation for example.
Abstract
We address the issue of consuming heterogeneous annotation data for Chinese word segmentation and part-of-speech tagging.
Conclusion
Our theoretical and empirical analysis of two representative popular corpora highlights two essential characteristics of heterogeneous annotations which are eXplored to reduce approximation and estimation errors for Chinese word segmentation and POS tagging.
Experiments
Previous studies on joint Chinese word segmentation and POS tagging have used the CTB in experiments.
Introduction
This paper explores heterogeneous annotations to reduce both approximation and estimation errors for Chinese word segmentation and part-of-speech (POS) tagging, which are fundamental steps for more advanced Chinese language processing tasks.
Introduction
In particular, joint word segmentation and POS tagging is addressed as a two step process.
Joint Chinese Word Segmentation and POS Tagging
words, word segmentation and POS tagging are important initial steps for Chinese language processing.
Joint Chinese Word Segmentation and POS Tagging
Two kinds of approaches are popular for joint word segmentation and POS tagging.
word segmentation is mentioned in 9 sentences in this paper.
Topics mentioned in this paper:
Liu, Chang and Ng, Hwee Tou
Conclusion
TESLA-CELAB does not have a segmentation step, hence it will not introduce word segmentation errors.
Discussion and Future Work
Chinese word segmentation .
Experiments
We use the Stanford Chinese word segmenter (Tseng et al., 2005) and POS tagger (Toutanova et al., 2003) for preprocessing and Cilin for synonym
Experiments
Note also that the word segmentations shown in these examples are for clarity only.
Introduction
The most obvious challenge for Chinese is that of word segmentation .
Introduction
However, many different segmentation standards eXist for different purposes, such as Microsoft Research Asia (MSRA) for Named Entity Recognition (NER), Chinese Treebank (CTB) for parsing and part-of-speech (POS) tagging, and City University of Hong Kong (CITYU) and Academia Sinica (AS) for general word segmentation and POS tagging.
Introduction
The only prior work attempting to address the problem of word segmentation in automatic MT evaluation for Chinese that we are aware of is Li et
Motivation
Character-based metrics do not suffer from errors and differences in word segmentation , so and ¥_l—?fi_5lk would be judged exactly equal.
word segmentation is mentioned in 8 sentences in this paper.
Topics mentioned in this paper:
Elsner, Micha and Goldwater, Sharon and Eisenstein, Jacob
Experiments
This version, which contains 9790 utterances (33399 tokens, 1321 types), is now standard for word segmentation , but contains no phonetic variability.
Experiments
As a simple extension of our model to the case of unknown word boundaries, we interleave it with an existing model of word segmentation , olpseg (Gold-
Introduction
For example, many models of word segmentation implicitly or explicitly build a lexicon while segmenting the input stream of phonemes into word tokens; in nearly all cases the phonemic input is created from an orthographic transcription using a phonemic dictionary, thus abstracting away from any phonetic variability (Brent, 1999; Venkataraman, 2001; Swingley, 2005; Goldwater et al., 2009, among others).
Related work
A final line of related work is on word segmentation .
Related work
In addition to the models mentioned in Section 1, which use phonemic input, a few models of word segmentation have been tested using phonetic input (Fleck, 2008; Rytting, 2007; Daland and Pierrehum—bert, 2010).
word segmentation is mentioned in 5 sentences in this paper.
Topics mentioned in this paper:
Li, Zhenghua and Liu, Ting and Che, Wanxiang
Experiments and Analysis
CDT and CTB5/6 adopt different POS tag sets, and converting from one tag set to another is difficult (Niu et al., 2009).5 To overcome this problem, we use the People’s Daily corpus (PD),6 a large—scale corpus annotated with word segmentation and POS tags, to train a statistical POS tagger.
Experiments and Analysis
5 The word segmentation standards of the two treebanks also slightly differs, which are not considered in this work.
Experiments and Analysis
Moreover, inferior results may be gained due to the differences between CTB5 and PD in word segmentation standards and text sources.
Related Work
(2009) improve the performance of word segmentation and part—of—speech (POS) tagging on CTBS using another large—scale corpus of different annotation standards (People’s Daily).
word segmentation is mentioned in 4 sentences in this paper.
Topics mentioned in this paper:
Sun, Weiwei and Uszkoreit, Hans
Capturing Paradigmatic Relations via Word Clustering
3.1.3 Preprocessing: Word Segmentation
Capturing Paradigmatic Relations via Word Clustering
In this table, the symbol “+” in the Features column means current configuration contains both the baseline features and new cluster-based features; the number is the total number of the clusters; the symbol “+” in the Data column means which portion of the Gigaword data is used to cluster words; the symbol “S” and “SS” in parentheses denote (s)upervised and (s)emi-(s)upervised word segmentation .
State-of-the-Art
Penn Chinese Treebank (CTB) (Xue et al., 2005) is a popular data set to evaluate a number of Chinese NLP tasks, including word segmentation (Sun and
word segmentation is mentioned in 3 sentences in this paper.
Topics mentioned in this paper: