Index of papers in Proc. ACL 2013 that mention
  • word segmentation
Börschinger, Benjamin and Johnson, Mark and Demuth, Katherine
Abstract
Current computational models of unsupervised word segmentation usually assume idealized input that is devoid of these kinds of variation.
Abstract
We extend a nonparametric model of word segmentation by adding phonological rules that map from underlying forms to surface forms to produce a mathematically well-defined joint model as a first step towards handling variation and segmentation in a single model.
Abstract
We analyse how our model handles /t/-deletion on a large corpus of transcribed speech, and show that the joint model can perform word segmentation and recover underlying /t/s.
Background and related work
This permits us to develop a joint generative model for both word segmentation and variation which we plan to extend to handle more phenomena in future work.
Background and related work
They do not aim for a joint model that also handles word segmentation , however, and rather than training their model on an actual corpus, they evaluate on constructed lists of examples, mimicking frequencies of real data.
Experiments 4.1 The data
This allows us to investigate the strength of the statistical signal for the deletion rule without confounding it with the word segmentation performance, and to see how the different contextual settings uniform, right and left-right handle the data.
Experiments 4.1 The data
Table 5: /t/-recovery F-scores when performing joint word segmention in the left-right setting, averaged over two runs (standard errors less than 2%).
Experiments 4.1 The data
Finally, we are also interested to learn how well we can do word segmentation and underlying /t/-recovery jointly.
Introduction
Computational models of word segmentation try to solve one of the first problems language learners have to face: breaking an unsegmented stream of sound segments into individual words.
The computational model
Figure l: The graphical model for our joint model of word-final /t/-deletion and Bigram word segmentation .
The computational model
Bayesian word segmentation models try to compactly represent the observed data in terms of a small set of units (word types) and a short analysis (a small number of word tokens).
word segmentation is mentioned in 19 sentences in this paper.
Topics mentioned in this paper:
Jiang, Wenbin and Sun, Meng and Lü, Yajuan and Yang, Yating and Liu, Qun
Abstract
Structural information in web text provides natural annotations for NLP problems such as word segmentation and parsing.
Abstract
With Chinese word segmentation as a case study, experiments show that the segmenter enhanced with the Chinese wikipedia achieves significant improvement on a series of testing sets from different domains, even with a single classifier and local features.
Introduction
Problems related to information retrieval, machine translation and social computing need fast and accurate text processing, for example, word segmentation and parsing.
Introduction
Taking Chinese word segmentation for example, the state-of-the-art models (Xue and Shen, 2003; Ng and Low, 2004; Gao et al., 2005; Nakagawa and Uchimoto, 2007; Zhao and Kit, 2008; J iang et al., 2009; Zhang and Clark, 2010; Sun, 2011b; Li, 2011) are usually trained on human-annotated corpora such as the Penn Chinese Treebank (CTB) (Xue et al., 2005), and perform quite well on corresponding test sets.
Introduction
(b) Knowledge for word segmentation
word segmentation is mentioned in 41 sentences in this paper.
Topics mentioned in this paper:
Zhang, Meishan and Zhang, Yue and Che, Wanxiang and Liu, Ting
Character-based Chinese Parsing
To produce character-level trees for Chinese NLP tasks, we develop a character-based parsing model, which can jointly perform word segmentation , POS tagging and phrase-structure parsing.
Character-based Chinese Parsing
First, we split the original SHIFT action into SHIFT—SEPARATE (t) and SHIFT—APPEND, which jointly perform the word segmentation and POS tagging tasks.
Character-based Chinese Parsing
The string features are used for word segmentation and POS tagging, and are adapted from a state-of-the-art joint segmentation and tagging model (Zhang and Clark, 2010).
Experiments
Since our model can jointly process word segmentation , POS tagging and phrase-structure parsing, we evaluate our model for the three tasks, respectively.
Experiments
For word segmentation and POS tagging, standard metrics of word precision, recall and F-score are used, where the tagging accuracy is the joint accuracy of word segmentation and POS tagging.
Introduction
With regard to task of parsing itself, an important advantage of the character-level syntax trees is that they allow word segmentation , part-of-speech (POS) tagging and parsing to be performed jointly, using an efficient CKY-style or shift-reduce algorithm.
Introduction
To analyze word structures in addition to phrase structures, our character-based parser naturally performs joint word segmentation , POS tagging and parsing jointly.
Introduction
We extend their shift-reduce framework, adding more transition actions for word segmentation and POS tagging, and defining novel features that capture character information.
Word Structures and Syntax Trees
They made use of this information to help joint word segmentation and POS tagging.
Word Structures and Syntax Trees
For leaf characters, we follow previous work on word segmentation (Xue, 2003; Ng and Low, 2004), and use “b” and “i” to indicate the beginning and non-beginning characters of a word, respectively.
word segmentation is mentioned in 24 sentences in this paper.
Topics mentioned in this paper:
Zhang, Longkai and Li, Li and He, Zhengyan and Wang, Houfeng and Sun, Ni
Abstract
While no segmented corpus of micro-blogs is available to train Chinese word segmentation model, existing Chinese word segmentation tools cannot perform equally well as in ordinary news texts.
Abstract
In this paper we present an effective yet simple approach to Chinese word segmentation of micro-blog.
Experiment
We use the benchmark datasets provided by the second International Chinese Word Segmentation Bakeoff2 as the labeled data.
Experiment
The first two are both famous Chinese word segmentation tools: ICTCLAS3 and Stanford Chinese word segmenter4, which are widely used in NLP related to word segmentation .
Experiment
Stanford Chinese word segmenter is a CRF-based segmentation tool and its segmentation standard is chosen as the PKU standard, which is the same to ours.
INTRODUCTION
These new features of micro-blogs make the Chinese Word Segmentation (CWS) models trained on the source domain, such as news corpus, fail to perform equally well when transferred to texts from micro-blogs.
Our method
Chinese word segmentation problem might be treated as a character labeling problem which gives each character a label indicating its position in one word.
Related Work
Recent studies show that character sequence labeling is an effective formulation of Chinese word segmentation (Low et al., 2005; Zhao et al., 2006a,b; Chen et al., 2006; Xue, 2003).
Related Work
On the other hand unsupervised word segmentation Peng and Schu-urmans (2001); Goldwater et al.
Related Work
(1998) takes advantage of the huge amount of raw text to solve Chinese word segmentation problems.
word segmentation is mentioned in 13 sentences in this paper.
Topics mentioned in this paper:
Zeng, Xiaodong and Wong, Derek F. and Chao, Lidia S. and Trancoso, Isabel
Abstract
This paper introduces a graph-based semi-supervised joint model of Chinese word segmentation and part-of-speech tagging.
Introduction
Word segmentation and part-of-speech (POS) tagging are two critical and necessary initial procedures with respect to the majority of high-level Chinese language processing tasks such as syntax parsing, information extraction and machine translation.
Introduction
The joint approaches of word segmentation and POS tagging (joint S&T) are proposed to resolve these two tasks simultaneously.
Introduction
As far as we know, however, these methods have not yet been applied to resolve the problem of joint Chinese word segmentation (CWS) and POS tagging.
Method
The performance measurement indicators for word segmentation and POS tagging (joint S&T) are balance F-score, F = 2PIU(P+R), the harmonic mean of precision (P) and recall (R), and out-of-vocabulary recall (OOV—R).
Method
This outcome verifies the commonly accepted fact that the joint model can substantially improve the pipeline one, since POS tags provide additional information to word segmentation (Ng and Low, 2004).
Method
Overall, for word segmentation , it obtains average improvements of 1.43% and 8.09% in F-score and OOV—R over others; for POS tagging, it achieves average improvements of 1.09% and 7.73%.
word segmentation is mentioned in 9 sentences in this paper.
Topics mentioned in this paper:
Wang, Aobo and Kan, Min-Yen
Abstract
We exploit this reliance as an opportunity: recognizing the relation between informal word recognition and Chinese word segmentation , we propose to model the two tasks jointly.
Conclusion
There is a close dependency between Chinese word segmentation (CWS) and informal word recognition (IWR).
Introduction
This example illustrates the mutual dependency between Chinese word segmentation (henceforth, CWS) and informal word recognition (IWR) that should be solved jointly.
Methodology
Given an input Chinese microblog post, our method simultaneously segments the sentences into words (the Chinese Word Segmentation , CWS, task), and marks the component words as informal or formal ones (the Informal Word Re-congition, IWR, task).
Methodology
Character-based sequence labeling is employed for word segmentation due to its simplicity and robustness to the unknown word problem (Xue, 2003).
Related Work
Closely related to our work is the task of Chinese new word detection, normally treated as a separate process from word segmentation in most previous works (Chen and Bai, 1998; Wu and Jiang, 2000; Chen and Ma, 2002; Gao et al., 2005).
word segmentation is mentioned in 6 sentences in this paper.
Topics mentioned in this paper:
Zeng, Xiaodong and Wong, Derek F. and Chao, Lidia S. and Trancoso, Isabel
Abstract
This paper presents a semi-supervised Chinese word segmentation (CWS) approach that co-regularizes character-based and word-based models.
Experiment
Table 2 shows the F-score results of word segmentation on CTB-5, CTB-6 and CTB-7 testing sets.
Experiment
It is a supervised joint model of word segmentation , POS tagging and dependency parsing.
Introduction
Chinese word segmentation (CWS) is a critical and a necessary initial procedure with respect to the majority of high-level Chinese language processing tasks such as syntax parsing, information extraction and machine translation, since Chinese scripts are written in continuous characters without explicit word boundaries.
Segmentation Models
Character-based models treat word segmentation as a sequence labeling problem, assigning labels to the characters in a sentence indicating their positions in a word.
word segmentation is mentioned in 5 sentences in this paper.
Topics mentioned in this paper:
Chen, Ruey-Cheng
Abstract
We study the mathematical properties of a recently proposed MDL-based unsupervised word segmentation algorithm, called regularized compression.
Concluding Remarks
A natural extension of this work is to reproduce this result on some other word segmentation benchmarks, specifically those in other Asian languages (Emerson, 2005; Zhikov et al., 2010).
Introduction
Hierarchical Bayes methods have been mainstream in unsupervised word segmentation since the dawn of hierarchical Dirichlet process (Goldwater et al., 2009) and adaptors grammar (Johnson and Goldwater, 2009).
word segmentation is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Hagiwara, Masato and Sekine, Satoshi
Abstract
Transliterated compound nouns not separated by whitespaces pose difficulty on word segmentation 0f-flz'ne approaches have been proposed to split them using word statistics, but they rely on static lexicon, limiting their use.
Introduction
Accurate word segmentation (WS) is the key components in successful language processing.
Introduction
For example, when splitting a compound noun 7 fiafi‘y‘ynl/‘yF‘ bumkz’sshureddo, a traditional word segmenter can easily segment this as 7 fififi‘y/‘an/yh“ “*blacki shred” since Val/y F‘ shureddo “shred” is a known, frequent word.
word segmentation is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Wang, Zhigang and Li, Zhixing and Li, Juanzi and Tang, Jie and Z. Pan, Jeff
Experiments
The first category is caused by incorrect word segmentation (40.85%).
Experiments
The result of word segmentation directly decide the performance of extraction so it causes most of the errors.
Experiments
In the future, we can improve the performance of WikiCiKE by polishing the word segmentation result.
word segmentation is mentioned in 3 sentences in this paper.
Topics mentioned in this paper: