Index of papers in Proc. ACL that mention
  • word segmentation
Zhang, Meishan and Zhang, Yue and Che, Wanxiang and Liu, Ting
Abstract
Character-level information can benefit downstream applications by offering flexible granularities for word segmentation while improving word-level dependency parsing accuracies.
Character-Level Dependency Tree
We differentiate intra-word dependencies and inter-word dependencies by the arc type, so that our work can be compared with conventional word segmentation , POS-tagging and dependency parsing pipelines under a canonical segmentation standard.
Character-Level Dependency Tree
The character-level dependency trees hold to a specific word segmentation standard, but are not limited to it.
Character-Level Dependency Tree
A transition-based framework with global learning and beam search decoding (Zhang and Clark, 2011) has been applied to a number of natural language processing tasks, including word segmentation , PCS-tagging and syntactic parsing (Zhang and Clark, 2010; Huang and Sagae, 2010; Bohnet and Nivre, 2012; Zhang et al., 2013).
Introduction
Chinese dependency trees were conventionally defined over words (Chang et al., 2009; Li et al., 2012), requiring word segmentation and POS-tagging as preprocessing steps.
Introduction
First, character-level trees circumvent the issue that no universal standard exists for Chinese word segmentation .
Introduction
In the well-known Chinese word segmentation bakeoff tasks, for example, different segmentation standards have been used by different data sets (Emerson, 2005).
word segmentation is mentioned in 24 sentences in this paper.
Topics mentioned in this paper:
Zhang, Longkai and Li, Li and He, Zhengyan and Wang, Houfeng and Sun, Ni
Abstract
While no segmented corpus of micro-blogs is available to train Chinese word segmentation model, existing Chinese word segmentation tools cannot perform equally well as in ordinary news texts.
Abstract
In this paper we present an effective yet simple approach to Chinese word segmentation of micro-blog.
Experiment
We use the benchmark datasets provided by the second International Chinese Word Segmentation Bakeoff2 as the labeled data.
Experiment
The first two are both famous Chinese word segmentation tools: ICTCLAS3 and Stanford Chinese word segmenter4, which are widely used in NLP related to word segmentation .
Experiment
Stanford Chinese word segmenter is a CRF-based segmentation tool and its segmentation standard is chosen as the PKU standard, which is the same to ours.
INTRODUCTION
These new features of micro-blogs make the Chinese Word Segmentation (CWS) models trained on the source domain, such as news corpus, fail to perform equally well when transferred to texts from micro-blogs.
Our method
Chinese word segmentation problem might be treated as a character labeling problem which gives each character a label indicating its position in one word.
Related Work
Recent studies show that character sequence labeling is an effective formulation of Chinese word segmentation (Low et al., 2005; Zhao et al., 2006a,b; Chen et al., 2006; Xue, 2003).
Related Work
On the other hand unsupervised word segmentation Peng and Schu-urmans (2001); Goldwater et al.
Related Work
(1998) takes advantage of the huge amount of raw text to solve Chinese word segmentation problems.
word segmentation is mentioned in 13 sentences in this paper.
Topics mentioned in this paper:
Zhang, Meishan and Zhang, Yue and Che, Wanxiang and Liu, Ting
Character-based Chinese Parsing
To produce character-level trees for Chinese NLP tasks, we develop a character-based parsing model, which can jointly perform word segmentation , POS tagging and phrase-structure parsing.
Character-based Chinese Parsing
First, we split the original SHIFT action into SHIFT—SEPARATE (t) and SHIFT—APPEND, which jointly perform the word segmentation and POS tagging tasks.
Character-based Chinese Parsing
The string features are used for word segmentation and POS tagging, and are adapted from a state-of-the-art joint segmentation and tagging model (Zhang and Clark, 2010).
Experiments
Since our model can jointly process word segmentation , POS tagging and phrase-structure parsing, we evaluate our model for the three tasks, respectively.
Experiments
For word segmentation and POS tagging, standard metrics of word precision, recall and F-score are used, where the tagging accuracy is the joint accuracy of word segmentation and POS tagging.
Introduction
With regard to task of parsing itself, an important advantage of the character-level syntax trees is that they allow word segmentation , part-of-speech (POS) tagging and parsing to be performed jointly, using an efficient CKY-style or shift-reduce algorithm.
Introduction
To analyze word structures in addition to phrase structures, our character-based parser naturally performs joint word segmentation , POS tagging and parsing jointly.
Introduction
We extend their shift-reduce framework, adding more transition actions for word segmentation and POS tagging, and defining novel features that capture character information.
Word Structures and Syntax Trees
They made use of this information to help joint word segmentation and POS tagging.
Word Structures and Syntax Trees
For leaf characters, we follow previous work on word segmentation (Xue, 2003; Ng and Low, 2004), and use “b” and “i” to indicate the beginning and non-beginning characters of a word, respectively.
word segmentation is mentioned in 24 sentences in this paper.
Topics mentioned in this paper:
Jiang, Wenbin and Sun, Meng and Lü, Yajuan and Yang, Yating and Liu, Qun
Abstract
Structural information in web text provides natural annotations for NLP problems such as word segmentation and parsing.
Abstract
With Chinese word segmentation as a case study, experiments show that the segmenter enhanced with the Chinese wikipedia achieves significant improvement on a series of testing sets from different domains, even with a single classifier and local features.
Introduction
Problems related to information retrieval, machine translation and social computing need fast and accurate text processing, for example, word segmentation and parsing.
Introduction
Taking Chinese word segmentation for example, the state-of-the-art models (Xue and Shen, 2003; Ng and Low, 2004; Gao et al., 2005; Nakagawa and Uchimoto, 2007; Zhao and Kit, 2008; J iang et al., 2009; Zhang and Clark, 2010; Sun, 2011b; Li, 2011) are usually trained on human-annotated corpora such as the Penn Chinese Treebank (CTB) (Xue et al., 2005), and perform quite well on corresponding test sets.
Introduction
(b) Knowledge for word segmentation
word segmentation is mentioned in 41 sentences in this paper.
Topics mentioned in this paper:
Johnson, Mark and Christophe, Anne and Dupoux, Emmanuel and Demuth, Katherine
Abstract
Inspired by experimental psychological findings suggesting that function words play a special role in word learning, we make a simple modification to an Adaptor Grammar based Bayesian word segmentation model to allow it to learn sequences of monosyllabic “function words” at the beginnings and endings of collocations of (possibly multisyllabic) words.
Abstract
This modification improves unsupervised word segmentation on the standard Bernstein-Ratner (1987) corpus of child-directed English by more than 4% token f-score compared to a model identical except that it does not special-case “function words”, setting a new state-of-the-art of 92.4% token f-score.
Introduction
We do this by comparing two computational models of word segmentation which differ solely in the way that they model function words.
Introduction
(1996) and Brent (1999) our word segmentation models identify word boundaries from unsegmented sequences of phonemes corresponding to utterances, effectively performing unsupervised learning of a lexicon.
Introduction
a word segmentation model should segment this as ju want tu si 69 buk, which is the IPA representation of “you want to see the book”.
Word segmentation with Adaptor Grammars
Perhaps the simplest word segmentation model is the unigram model, where utterances are modeled as sequences of words, and where each word is a sequence of segments (Brent, 1999; Goldwater et al., 2009).
word segmentation is mentioned in 24 sentences in this paper.
Topics mentioned in this paper:
Börschinger, Benjamin and Johnson, Mark and Demuth, Katherine
Abstract
Current computational models of unsupervised word segmentation usually assume idealized input that is devoid of these kinds of variation.
Abstract
We extend a nonparametric model of word segmentation by adding phonological rules that map from underlying forms to surface forms to produce a mathematically well-defined joint model as a first step towards handling variation and segmentation in a single model.
Abstract
We analyse how our model handles /t/-deletion on a large corpus of transcribed speech, and show that the joint model can perform word segmentation and recover underlying /t/s.
Background and related work
This permits us to develop a joint generative model for both word segmentation and variation which we plan to extend to handle more phenomena in future work.
Background and related work
They do not aim for a joint model that also handles word segmentation , however, and rather than training their model on an actual corpus, they evaluate on constructed lists of examples, mimicking frequencies of real data.
Experiments 4.1 The data
This allows us to investigate the strength of the statistical signal for the deletion rule without confounding it with the word segmentation performance, and to see how the different contextual settings uniform, right and left-right handle the data.
Experiments 4.1 The data
Table 5: /t/-recovery F-scores when performing joint word segmention in the left-right setting, averaged over two runs (standard errors less than 2%).
Experiments 4.1 The data
Finally, we are also interested to learn how well we can do word segmentation and underlying /t/-recovery jointly.
Introduction
Computational models of word segmentation try to solve one of the first problems language learners have to face: breaking an unsegmented stream of sound segments into individual words.
The computational model
Figure l: The graphical model for our joint model of word-final /t/-deletion and Bigram word segmentation .
The computational model
Bayesian word segmentation models try to compactly represent the observed data in terms of a small set of units (word types) and a short analysis (a small number of word tokens).
word segmentation is mentioned in 19 sentences in this paper.
Topics mentioned in this paper:
Zhao, Qiuye and Marcus, Mitch
Abstract
We show for both English POS tagging and Chinese word segmentation that with proper representation, large number of deterministic constraints can be learned from training examples, and these are useful in constraining probabilistic inference.
Abstract
In this work, we explore deterministic constraints for two fundamental NLP problems, English POS tagging and Chinese word segmentation .
Abstract
For Chinese word segmentation (CWS), which can be formulated as character tagging, analogous constraints can be learned with the same templates as English POS tagging.
word segmentation is mentioned in 15 sentences in this paper.
Topics mentioned in this paper:
Sun, Xu and Wang, Houfeng and Li, Wenjie
Abstract
We present a joint model for Chinese word segmentation and new word detection.
Abstract
As we know, training a word segmentation system on large-scale datasets is already costly.
Introduction
The major problem of Chinese word segmentation is the ambiguity.
Introduction
In this paper, we present high dimensional new features, including word-based features and enriched edge (label-transition) features, for the joint modeling of Chinese word segmentation (CWS) and new word detection (NWD).
Introduction
As we know, training a word segmentation system on large-scale datasets is already costly.
Related Work
First, we review related work on word segmentation and new word detection.
Related Work
2.1 Word Segmentation and New Word Detection
Related Work
Conventional approaches to Chinese word segmentation treat the problem as a sequential labeling task (Xue, 2003; Peng et al., 2004; Tseng et al., 2005; Asahara et al., 2005; Zhao et al., 2010).
word segmentation is mentioned in 23 sentences in this paper.
Topics mentioned in this paper:
Pei, Wenzhe and Ge, Tao and Chang, Baobao
Abstract
In this paper, we propose a novel neural network model for Chinese word segmentation called Max-Margin Tensor Neural Network (MMTNN).
Abstract
Despite Chinese word segmentation being a specific case, MMTNN can be easily generalized and applied to other sequence labeling tasks.
Conventional Neural Network
Formally, in the Chinese word segmentation task, we have a character dictionary D of size Unless otherwise specified, the character dictionary is extracted from the training set and unknown characters are mapped to a special symbol that is not used elsewhere.
Conventional Neural Network
In Chinese word segmentation , the most prevalent tag set T is BMES tag set, which uses 4 tags to carry word boundary information.
Conventional Neural Network
(2013) modeled Chinese word segmentation as a series of
Introduction
Therefore, word segmentation is a preliminary and important pre-process for Chinese language processing.
Introduction
(2011) to Chinese word segmentation and POS tagging and proposed a perceptron-style algorithm to speed up the training process with negligible loss in performance.
Introduction
We evaluate the performance of Chinese word segmentation on the PKU and MSRA benchmark datasets in the second International Chinese Word Segmentation Bakeoff (Emerson, 2005) which are commonly used for evaluation of Chinese word segmentation .
word segmentation is mentioned in 19 sentences in this paper.
Topics mentioned in this paper:
Shen, Mo and Liu, Hongxiao and Kawahara, Daisuke and Kurohashi, Sadao
Abstract
The focus of recent studies on Chinese word segmentation , part-of-speech (POS) tagging and parsing has been shifting from words to characters.
Abstract
We propose a method that performs character-level POS tagging jointly with word segmentation and word-level POS tagging.
Chinese Morphological Analysis with Character-level POS
Previous studies have shown that jointly processing word segmentation and POS tagging is preferable to pipeline processing, which can propagate errors (Nakagawa and Uchimoto, 2007; Kruengkrai et a1., 2009).
Conclusion
A Cascaded Linear Model for Joint Chinese Word Segmentation and Part-of-speech Tagging.
Conclusion
Word Lattice Reranking for Chinese Word Segmentation
Evaluation
To evaluate our proposed method, we have conducted two sets of experiments on CTB5: word segmentation, and joint word segmentation and word-level POS tagging.
Evaluation
The results of the word segmentation experiment and the joint experiment of segmentation and POS tagging are shown in Table 5(a) and Table 5(b), respectively.
Evaluation
The results show that, while the differences between the baseline model and the proposed model in word segmentation accuracies are small, the proposed model achieves significant improvement in the experiment of joint segmentati-
Introduction
In recent years, the focus of research on Chinese word segmentation , part-of-speech (POS) tagging and parsing has been shifting from words toward characters.
Introduction
We propose a method that performs character-level POS tagging jointly with word segmentation and word-level POS tagging.
word segmentation is mentioned in 20 sentences in this paper.
Topics mentioned in this paper:
Hatori, Jun and Matsuzaki, Takuya and Miyao, Yusuke and Tsujii, Jun'ichi
Abstract
We propose the first joint model for word segmentation , POS tagging, and dependency parsing for Chinese.
Introduction
spaces) between words, word segmentation is the crucial first step that is necessary to perform Virtually all NLP tasks.
Introduction
Because the tasks of word segmentation and POS tagging have strong interactions, many studies have been devoted to the task of joint word segmentation and POS tagging for languages such as Chinese (e.g.
Introduction
The joint approach to word segmentation and POS tagging has been reported to improve word segmentation and POS tagging accuracies by more than
Model
(2011), we build our joint model to solve word segmentation , POS tagging, and dependency parsing within a single framework.
Model
In our joint model, the early update is invoked by mistakes in any of word segmentation , POS tagging, or dependency parsing.
Related Works
In addition, the lattice does not include word segmentation ambiguities crossing boundaries of space-delimited tokens.
Related Works
However, because they regarded word segmentation as given, their model did not consider the
word segmentation is mentioned in 17 sentences in this paper.
Topics mentioned in this paper:
Mochihashi, Daichi and Yamada, Takeshi and Ueda, Naonori
Abstract
In this paper, we propose a new Bayesian model for fully unsupervised word segmentation and an efficient blocked Gibbs sampler combined with dynamic programming for inference.
Abstract
We confirmed that it significantly outperforms previous reported results in both phonetic transcripts and standard datasets for Chinese and Japanese word segmentation .
Inference
To find the hidden word segmentation w of a string 3 = 01 - - - c N, which is equivalent to the vector of binary hidden variables 2 = 21 - - - ZN, the simplest approach is to build a Gibbs sampler that randomly selects a character c,- and draw a binary decision 2,- as to whether there is a word boundary, and then update the language model according to the new segmentation (Goldwater et al., 2006; Xu et al., 2008).
Introduction
Asian languages such as Chinese and Japanese have no explicit word boundaries, thus word segmentation is a crucial first step when processing them.
Introduction
In order to extract “words” from text streams, unsupervised word segmentation is an important research area because the criteria for creating supervised training data could be arbitrary, and will be suboptimal for applications that rely on segmentations.
Introduction
This maximizes the probability of word segmentation w given a string 3 :
Nested Pitman-Yor Language Model
If a lexicon is finite, we can use a uniform prior G0(w) = l/|V| for every word 21) in lexicon V. However, with word segmentation every substring could be a word, thus the lexicon is not limited but will be countably infinite.
Nested Pitman-Yor Language Model
Building an accurate G0 is crucial for word segmentation , since it determines how the possible words will look like.
word segmentation is mentioned in 25 sentences in this paper.
Topics mentioned in this paper:
Kruengkrai, Canasai and Uchimoto, Kiyotaka and Kazama, Jun'ichi and Wang, Yiou and Torisawa, Kentaro and Isahara, Hitoshi
Abstract
In this paper, we present a discriminative word-character hybrid model for joint Chinese word segmentation and POS tagging.
Background
In joint word segmentation and the POS tagging process, the task is to predict a path
Experiments
Previous studies on joint Chinese word segmentation and POS tagging have used Penn Chinese Treebank (CTB) (Xia et al., 2000) in experiments.
Experiments
We evaluated both word segmentation (Seg) and joint word segmentation and POS tagging (Seg & Tag).
Experiments
(2008a; 2008b) on CTB 5.0 and Zhang and Clark (2008) on CTB 4.0 since they reported the best performances on joint word segmentation and POS tagging using the training materials only derived from the corpora.
Introduction
In Chinese, word segmentation and part-of-speech (POS) tagging are indispensable steps for higher-level NLP tasks.
Introduction
Word segmentation and POS tagging results are required as inputs to other NLP tasks, such as phrase chunking, dependency parsing, and machine translation.
Introduction
Word segmentation and POS tagging in a joint process have received much attention in recent research and have shown improvements over a pipelined fashion (Ng and Low, 2004; Nakagawa and Uchimoto, 2007; Zhang and Clark, 2008; Jiang et al., 2008a; Jiang et al., 2008b).
Policies for correct path selection
4In our experiments, the optimal threshold value 7" is selected by evaluating the performance of joint word segmentation and POS tagging on the development set.
Related work
Maximum entropy models are widely used for word segmentation and POS tagging tasks (Uchimoto et al., 2001; Ng and Low, 2004; Nakagawa, 2004; Nakagawa and Uchimoto, 2007) since they only need moderate training times while they provide reasonable performance.
word segmentation is mentioned in 14 sentences in this paper.
Topics mentioned in this paper:
Jiang, Wenbin and Huang, Liang and Liu, Qun
Abstract
We test the efficacy of this method in the context of Chinese word segmentation and part-of-speech tagging, where no segmentation and POS tagging standards are widely accepted due to the lack of morphology in Chinese.
Automatic Annotation Adaptation
Considering that word segmentation and Joint S&T can be conducted in the same character classification manner, we can design an unified standard adaptation framework for the two tasks, by taking the source classifier’s classification result as the guide information for the target classifier’s classification decision.
Introduction
Figure l: Incompatible word segmentation and POS tagging standards between CTB (upper) and People’s Daily (below).
Introduction
To test the efficacy of our method we choose Chinese word segmentation and part-of-speech tagging, where the problem of incompatible annotation standards is one of the most evident: so far no segmentation standard is widely accepted due to the lack of a clear definition of Chinese words, and the (almost complete) lack of morphology results in much bigger ambiguities and heavy debates in tagging philosophies for Chinese parts-of-speech.
Introduction
In addition, the improved accuracies from segmentation and tagging also lead to an improved parsing accuracy on CTB, reducing 38% of the error propagation from word segmentation to parsing.
Segmentation and Tagging as Character Classification
01 02 .. On where C,- is a character, word segmentation aims to split the sequence into m(§ n) words: 01161 Cel+lzeg -- Cem_1+1:em
Segmentation and Tagging as Character Classification
Xue and Shen (2003) describe for the first time the character classification approach for Chinese word segmentation , Where each character is given a boundary tag denoting its relative position in a word.
Segmentation and Tagging as Character Classification
In addition, Ng and Low (2004) find that, compared with POS tagging after word segmentation , Joint S&T can achieve higher accuracy on both segmentation and POS tagging.
word segmentation is mentioned in 27 sentences in this paper.
Topics mentioned in this paper:
Johnson, Mark
Introduction
We show that simultaneously learning syllable structure and collocations improves word segmentation accuracy compared to models that learn these independently.
Introduction
This paper applies adaptor grammars to word segmentation and morphological acquisition.
Word segmentation with adaptor grammars
We now turn to linguistic applications of adaptor grammars, specifically, to models of unsupervised word segmentation .
Word segmentation with adaptor grammars
Table 1: Word segmentation f-score results for all models, as a function of DP concentration parameter oz.
Word segmentation with adaptor grammars
Table 1 summarizes the word segmentation f-scores for all models described in this paper.
word segmentation is mentioned in 23 sentences in this paper.
Topics mentioned in this paper:
Sun, Weiwei and Wan, Xiaojun
About Heterogeneous Annotations
For Chinese word segmentation and POS tagging, supervised learning has become a dominant paradigm.
About Heterogeneous Annotations
Take Chinese word segmentation for example.
Abstract
We address the issue of consuming heterogeneous annotation data for Chinese word segmentation and part-of-speech tagging.
Conclusion
Our theoretical and empirical analysis of two representative popular corpora highlights two essential characteristics of heterogeneous annotations which are eXplored to reduce approximation and estimation errors for Chinese word segmentation and POS tagging.
Experiments
Previous studies on joint Chinese word segmentation and POS tagging have used the CTB in experiments.
Introduction
This paper explores heterogeneous annotations to reduce both approximation and estimation errors for Chinese word segmentation and part-of-speech (POS) tagging, which are fundamental steps for more advanced Chinese language processing tasks.
Introduction
In particular, joint word segmentation and POS tagging is addressed as a two step process.
Joint Chinese Word Segmentation and POS Tagging
words, word segmentation and POS tagging are important initial steps for Chinese language processing.
Joint Chinese Word Segmentation and POS Tagging
Two kinds of approaches are popular for joint word segmentation and POS tagging.
word segmentation is mentioned in 9 sentences in this paper.
Topics mentioned in this paper:
Zeng, Xiaodong and Chao, Lidia S. and Wong, Derek F. and Trancoso, Isabel and Tian, Liang
Abstract
This study investigates on building a better Chinese word segmentation model for statistical machine translation.
Experiments
The influence of the word segmentation on the final translation is our main investigation.
Experiments
Firstly, as expected, having word segmentation does help Chinese-to-English MT.
Experiments
This section aims to further analyze the three primary observations concluded in Section 4.3: 2') word segmentation is useful to SMT; ii) the treebank and the bilingual segmentation knowledge are helpful, performing segmentation of different nature; and iii) the bilingual constraints lead to learn segmentations better tailored for SMT.
Introduction
Word segmentation is regarded as a critical procedure for high-level Chinese language processing tasks, since Chinese scripts are written in continuous characters without explicit word boundaries (e.g., space in English).
Introduction
The empirical works show that word segmentation can be beneficial to Chinese-to-English statistical machine translation (SMT) (Xu et al., 2005; Chang et al., 2008; Zhao et al., 2013).
Introduction
The practice in state-of-the-art MT systems is that Chinese sentences are tokenized by a monolingual supervised word segmentation model trained on the hand-annotated treebank data, e.g., Chinese treebank
word segmentation is mentioned in 9 sentences in this paper.
Topics mentioned in this paper:
Zeng, Xiaodong and Wong, Derek F. and Chao, Lidia S. and Trancoso, Isabel
Abstract
This paper introduces a graph-based semi-supervised joint model of Chinese word segmentation and part-of-speech tagging.
Introduction
Word segmentation and part-of-speech (POS) tagging are two critical and necessary initial procedures with respect to the majority of high-level Chinese language processing tasks such as syntax parsing, information extraction and machine translation.
Introduction
The joint approaches of word segmentation and POS tagging (joint S&T) are proposed to resolve these two tasks simultaneously.
Introduction
As far as we know, however, these methods have not yet been applied to resolve the problem of joint Chinese word segmentation (CWS) and POS tagging.
Method
The performance measurement indicators for word segmentation and POS tagging (joint S&T) are balance F-score, F = 2PIU(P+R), the harmonic mean of precision (P) and recall (R), and out-of-vocabulary recall (OOV—R).
Method
This outcome verifies the commonly accepted fact that the joint model can substantially improve the pipeline one, since POS tags provide additional information to word segmentation (Ng and Low, 2004).
Method
Overall, for word segmentation , it obtains average improvements of 1.43% and 8.09% in F-score and OOV—R over others; for POS tagging, it achieves average improvements of 1.09% and 7.73%.
word segmentation is mentioned in 9 sentences in this paper.
Topics mentioned in this paper:
Liu, Chang and Ng, Hwee Tou
Conclusion
TESLA-CELAB does not have a segmentation step, hence it will not introduce word segmentation errors.
Discussion and Future Work
Chinese word segmentation .
Experiments
We use the Stanford Chinese word segmenter (Tseng et al., 2005) and POS tagger (Toutanova et al., 2003) for preprocessing and Cilin for synonym
Experiments
Note also that the word segmentations shown in these examples are for clarity only.
Introduction
The most obvious challenge for Chinese is that of word segmentation .
Introduction
However, many different segmentation standards eXist for different purposes, such as Microsoft Research Asia (MSRA) for Named Entity Recognition (NER), Chinese Treebank (CTB) for parsing and part-of-speech (POS) tagging, and City University of Hong Kong (CITYU) and Academia Sinica (AS) for general word segmentation and POS tagging.
Introduction
The only prior work attempting to address the problem of word segmentation in automatic MT evaluation for Chinese that we are aware of is Li et
Motivation
Character-based metrics do not suffer from errors and differences in word segmentation , so and ¥_l—?fi_5lk would be judged exactly equal.
word segmentation is mentioned in 8 sentences in this paper.
Topics mentioned in this paper:
Huang, Minlie and Ye, Borui and Wang, Yichen and Chen, Haiqiang and Cheng, Junjun and Zhu, Xiaoyan
Abstract
Automatic extraction of new words is an indispensable precursor to many NLP tasks such as Chinese word segmentation , named entity extraction, and sentiment analysis.
Experiment
The posts were then part-of-speech tagged using a Chinese word segmentation tool named ICTCLAS (Zhang et al., 2003).
Introduction
Automatic extraction of new words is indispensable to many tasks such as Chinese word segmentation , machine translation, named entity extraction, question answering, and sentiment analysis.
Introduction
New word detection is one of the most critical issues in Chinese word segmentation .
Introduction
Recent studies (Sproat and Emerson, 2003) (Chen, 2003) have shown that more than 60% of word segmentation errors result from new words.
Methodology
Obviously, in order to obtain the value of 3(wi), some particular Chinese word segmentation tool is required.
Related Work
New word detection has been usually interweaved with word segmentation , particularly in Chinese NLP.
Related Work
In these works, new word detection is considered as an integral part of segmentation, where new words are identified as the most probable segments inferred by the probabilistic models; and the detected new word can be further used to improve word segmentation .
word segmentation is mentioned in 8 sentences in this paper.
Topics mentioned in this paper:
Wang, Xiaolin and Utiyama, Masao and Finch, Andrew and Sumita, Eiichiro
Abstract
Unsupervised word segmentation (UWS) can provide domain-adaptive segmentation for statistical machine translation (SMT) without annotated data, and bilingual UWS can even optimize segmentation for alignment.
Complexity Analysis
The proposed method does not require any annotated data, but the SMT system with it can achieve comparable performance compared to state-of-the-art supervised word segmenters trained on precious annotated data.
Complexity Analysis
Moreover, the proposed method yields 0.96 BLEU improvement relative to supervised word segmenters on an out-of-domain corpus.
Complexity Analysis
Thus, we believe that the proposed method would benefit SMT related to low-resource languages where annotated data are scare, and would also find application in domains that differ too greatly from the domains on which supervised word segmenters were trained.
Introduction
Many languages, especially Asian languages such as Chinese, Japanese and Myanmar, have no explicit word boundaries, thus word segmentation (WS), that is, segmenting the continuous texts of these languages into isolated words, is a prerequisite for many natural language processing applications including SMT.
Introduction
o improvement of BLEU scores compared to supervised Stanford Chinese word segmenter .
word segmentation is mentioned in 6 sentences in this paper.
Topics mentioned in this paper:
Jia, Zhongye and Zhao, Hai
Experiments
Maximum matching word segmentation is used with a large word vocabulary V extracted from web data provided by (Wang et al., 2013b).
Pinyin Input Method Model
Without word delimiters, linguists have argued on what a Chinese word really is for a long time and that is why there is always a primary word segmentation treatment in most Chinese language processing tasks (Zhao et al., 2006; Huang and Zhao, 2007; Zhao and Kit, 2008; Zhao et al., 2010; Zhao and Kit, 2011; Zhao et al., 2013).
Pinyin Input Method Model
A Chinese word may contain from 1 to over 10 characters due to different word segmentation conventions.
Pinyin Input Method Model
Nevertheless, pinyin syllable segmentation is a much easier problem compared to Chinese word segmentation .
word segmentation is mentioned in 6 sentences in this paper.
Topics mentioned in this paper:
Wang, Aobo and Kan, Min-Yen
Abstract
We exploit this reliance as an opportunity: recognizing the relation between informal word recognition and Chinese word segmentation , we propose to model the two tasks jointly.
Conclusion
There is a close dependency between Chinese word segmentation (CWS) and informal word recognition (IWR).
Introduction
This example illustrates the mutual dependency between Chinese word segmentation (henceforth, CWS) and informal word recognition (IWR) that should be solved jointly.
Methodology
Given an input Chinese microblog post, our method simultaneously segments the sentences into words (the Chinese Word Segmentation , CWS, task), and marks the component words as informal or formal ones (the Informal Word Re-congition, IWR, task).
Methodology
Character-based sequence labeling is employed for word segmentation due to its simplicity and robustness to the unknown word problem (Xue, 2003).
Related Work
Closely related to our work is the task of Chinese new word detection, normally treated as a separate process from word segmentation in most previous works (Chen and Bai, 1998; Wu and Jiang, 2000; Chen and Ma, 2002; Gao et al., 2005).
word segmentation is mentioned in 6 sentences in this paper.
Topics mentioned in this paper:
Chen, Yanping and Zheng, Qinghua and Zhang, Wei
Abstract
The Omni-word feature uses every potential word in a sentence as lexicon feature, reducing errors caused by word segmentation .
Abstract
Both Omni-word feature and soft constraint make a better use of sentence information and minimize the influences caused by Chinese word segmentation and parsing.
Feature Construction
On the other hand, the Omni-word can avoid these problems and take advantages of Chinese characteristics (the word-formation and the ambiguity of word segmentation ).
Introduction
Lacking of orthographic word makes Chinese word segmentation difficult.
Related Work
(2008; 2010) also pointed out that, due to the inaccuracy of Chinese word segmentation and parsing, the tree kernel based approach is inappropriate for Chinese relation extraction.
word segmentation is mentioned in 5 sentences in this paper.
Topics mentioned in this paper:
Fleck, Margaret M.
A single character is used if no suffix occurs 10 times.
In a full understanding system, output of the word segmenter would be passed to morphological and local syntactic processing.
A single character is used if no suffix occurs 10 times.
Because standard models of morphological learning don’t address the interaction with word segmentation , WordEnds does a simple version of this repair process using a placeholder algorithm called Mini-morph.
Previous work
Word segmentation experiments by Christiansen and Allen (1997) and Harrington et al.
The task in more detail
The datasets are informal conversations in which debatable word segmentations are rare.
The task in more detail
A theory of word segmentation must explain how affixes differ from freestanding function words.
word segmentation is mentioned in 5 sentences in this paper.
Topics mentioned in this paper:
Zeng, Xiaodong and Wong, Derek F. and Chao, Lidia S. and Trancoso, Isabel
Abstract
This paper presents a semi-supervised Chinese word segmentation (CWS) approach that co-regularizes character-based and word-based models.
Experiment
Table 2 shows the F-score results of word segmentation on CTB-5, CTB-6 and CTB-7 testing sets.
Experiment
It is a supervised joint model of word segmentation , POS tagging and dependency parsing.
Introduction
Chinese word segmentation (CWS) is a critical and a necessary initial procedure with respect to the majority of high-level Chinese language processing tasks such as syntax parsing, information extraction and machine translation, since Chinese scripts are written in continuous characters without explicit word boundaries.
Segmentation Models
Character-based models treat word segmentation as a sequence labeling problem, assigning labels to the characters in a sentence indicating their positions in a word.
word segmentation is mentioned in 5 sentences in this paper.
Topics mentioned in this paper:
Elsner, Micha and Goldwater, Sharon and Eisenstein, Jacob
Experiments
This version, which contains 9790 utterances (33399 tokens, 1321 types), is now standard for word segmentation , but contains no phonetic variability.
Experiments
As a simple extension of our model to the case of unknown word boundaries, we interleave it with an existing model of word segmentation , olpseg (Gold-
Introduction
For example, many models of word segmentation implicitly or explicitly build a lexicon while segmenting the input stream of phonemes into word tokens; in nearly all cases the phonemic input is created from an orthographic transcription using a phonemic dictionary, thus abstracting away from any phonetic variability (Brent, 1999; Venkataraman, 2001; Swingley, 2005; Goldwater et al., 2009, among others).
Related work
A final line of related work is on word segmentation .
Related work
In addition to the models mentioned in Section 1, which use phonemic input, a few models of word segmentation have been tested using phonetic input (Fleck, 2008; Rytting, 2007; Daland and Pierrehum—bert, 2010).
word segmentation is mentioned in 5 sentences in this paper.
Topics mentioned in this paper:
Li, Zhenghua and Liu, Ting and Che, Wanxiang
Experiments and Analysis
CDT and CTB5/6 adopt different POS tag sets, and converting from one tag set to another is difficult (Niu et al., 2009).5 To overcome this problem, we use the People’s Daily corpus (PD),6 a large—scale corpus annotated with word segmentation and POS tags, to train a statistical POS tagger.
Experiments and Analysis
5 The word segmentation standards of the two treebanks also slightly differs, which are not considered in this work.
Experiments and Analysis
Moreover, inferior results may be gained due to the differences between CTB5 and PD in word segmentation standards and text sources.
Related Work
(2009) improve the performance of word segmentation and part—of—speech (POS) tagging on CTBS using another large—scale corpus of different annotation standards (People’s Daily).
word segmentation is mentioned in 4 sentences in this paper.
Topics mentioned in this paper:
Zhao, Hai and Song, Yan and Kit, Chunyu and Zhou, Guodong
Exploiting the Translated Treebank
However, Chinese has a special primary processing task, i.e., word segmentation .
Exploiting the Translated Treebank
Note that CTB or any other Chinese treebank has its own word segmentation guideline.
Exploiting the Translated Treebank
English treebank is translated into Chinese word by word, Chinese words in the translated text are exactly some entries from the bilingual lexicon, they are actually irregular phrases, short sentences or something else rather than words that follows any existing word segmentation convention.
word segmentation is mentioned in 4 sentences in this paper.
Topics mentioned in this paper:
Wang, Zhigang and Li, Zhixing and Li, Juanzi and Tang, Jie and Z. Pan, Jeff
Experiments
The first category is caused by incorrect word segmentation (40.85%).
Experiments
The result of word segmentation directly decide the performance of extraction so it causes most of the errors.
Experiments
In the future, we can improve the performance of WikiCiKE by polishing the word segmentation result.
word segmentation is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Hagiwara, Masato and Sekine, Satoshi
Abstract
Transliterated compound nouns not separated by whitespaces pose difficulty on word segmentation 0f-flz'ne approaches have been proposed to split them using word statistics, but they rely on static lexicon, limiting their use.
Introduction
Accurate word segmentation (WS) is the key components in successful language processing.
Introduction
For example, when splitting a compound noun 7 fiafi‘y‘ynl/‘yF‘ bumkz’sshureddo, a traditional word segmenter can easily segment this as 7 fififi‘y/‘an/yh“ “*blacki shred” since Val/y F‘ shureddo “shred” is a known, frequent word.
word segmentation is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Chen, Ruey-Cheng
Abstract
We study the mathematical properties of a recently proposed MDL-based unsupervised word segmentation algorithm, called regularized compression.
Concluding Remarks
A natural extension of this work is to reproduce this result on some other word segmentation benchmarks, specifically those in other Asian languages (Emerson, 2005; Zhikov et al., 2010).
Introduction
Hierarchical Bayes methods have been mainstream in unsupervised word segmentation since the dawn of hierarchical Dirichlet process (Goldwater et al., 2009) and adaptors grammar (Johnson and Goldwater, 2009).
word segmentation is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Sun, Weiwei and Uszkoreit, Hans
Capturing Paradigmatic Relations via Word Clustering
3.1.3 Preprocessing: Word Segmentation
Capturing Paradigmatic Relations via Word Clustering
In this table, the symbol “+” in the Features column means current configuration contains both the baseline features and new cluster-based features; the number is the total number of the clusters; the symbol “+” in the Data column means which portion of the Gigaword data is used to cluster words; the symbol “S” and “SS” in parentheses denote (s)upervised and (s)emi-(s)upervised word segmentation .
State-of-the-Art
Penn Chinese Treebank (CTB) (Xue et al., 2005) is a popular data set to evaluate a number of Chinese NLP tasks, including word segmentation (Sun and
word segmentation is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Adler, Meni and Goldberg, Yoav and Gabay, David and Elhadad, Michael
Previous Work
Nakagawa (2004) combine word-level and character-level information for Chinese and Japanese word segmentation .
Previous Work
(of all words in a given sentence) and the POS tagging (of the known words) is based on a Viterbi search over a lattice composed of all possible word segmentations and the possible classifications of all observed characters.
Previous Work
Their experimental results show that the method achieves high accuracy over state-of-the-art methods for Chinese and Japanese word segmentation .
word segmentation is mentioned in 3 sentences in this paper.
Topics mentioned in this paper: