Fast Online Training with Frequency-Adaptive Learning Rates for Chinese Word Segmentation and New Word Detection
Sun, Xu and Wang, Houfeng and Li, Wenjie

Article Structure

Abstract

We present a joint model for Chinese word segmentation and new word detection.

Introduction

Since Chinese sentences are written as continuous sequences of characters, segmenting a character sequence into words is normally the first step in the pipeline of Chinese text processing.

Related Work

First, we review related work on word segmentation and new word detection.

System Architecture

3.1 A Joint Model Based on CRFs

Topics

word segmentation

Appears in 23 sentences as: word segment (1) Word Segmentation (3) word segmentation (18) word segments (2)
In Fast Online Training with Frequency-Adaptive Learning Rates for Chinese Word Segmentation and New Word Detection
  1. We present a joint model for Chinese word segmentation and new word detection.
    Page 1, “Abstract”
  2. As we know, training a word segmentation system on large-scale datasets is already costly.
    Page 1, “Abstract”
  3. The major problem of Chinese word segmentation is the ambiguity.
    Page 1, “Introduction”
  4. In this paper, we present high dimensional new features, including word-based features and enriched edge (label-transition) features, for the joint modeling of Chinese word segmentation (CWS) and new word detection (NWD).
    Page 1, “Introduction”
  5. As we know, training a word segmentation system on large-scale datasets is already costly.
    Page 1, “Introduction”
  6. To solve this challenging problem, we propose a new training method, adaptive online gradient descent based on feature frequency information (ADF), for very fast word segmentation with new word detection, even given large-scale datasets with high dimensional features.
    Page 1, “Introduction”
  7. 0 We propose a joint model for Chinese word segmentation and new word detection.
    Page 2, “Introduction”
  8. 0 Compared with prior work, our system achieves better accuracies on both word segmentation and new word detection.
    Page 2, “Introduction”
  9. First, we review related work on word segmentation and new word detection.
    Page 2, “Related Work”
  10. 2.1 Word Segmentation and New Word Detection
    Page 2, “Related Work”
  11. Conventional approaches to Chinese word segmentation treat the problem as a sequential labeling task (Xue, 2003; Peng et al., 2004; Tseng et al., 2005; Asahara et al., 2005; Zhao et al., 2010).
    Page 2, “Related Work”

See all papers in Proc. ACL 2012 that mention word segmentation.

See all papers in Proc. ACL that mention word segmentation.

Back to top.

Chinese word

Appears in 10 sentences as: Chinese Word (2) Chinese word (8)
In Fast Online Training with Frequency-Adaptive Learning Rates for Chinese Word Segmentation and New Word Detection
  1. We present a joint model for Chinese word segmentation and new word detection.
    Page 1, “Abstract”
  2. The major problem of Chinese word segmentation is the ambiguity.
    Page 1, “Introduction”
  3. In this paper, we present high dimensional new features, including word-based features and enriched edge (label-transition) features, for the joint modeling of Chinese word segmentation (CWS) and new word detection (NWD).
    Page 1, “Introduction”
  4. 0 We propose a joint model for Chinese word segmentation and new word detection.
    Page 2, “Introduction”
  5. Conventional approaches to Chinese word segmentation treat the problem as a sequential labeling task (Xue, 2003; Peng et al., 2004; Tseng et al., 2005; Asahara et al., 2005; Zhao et al., 2010).
    Page 2, “Related Work”
  6. This phenomenon will also undermine the performance of Chinese word segmentation.
    Page 3, “System Architecture”
  7. The B, I, E labels have been widely used in previous work of Chinese word segmentation (Sun et al., 2009b).
    Page 4, “System Architecture”
  8. _ We used benchmark datasets provided by the second International Chinese Word Segmentation Bakeoff to test our proposals.
    Page 6, “System Architecture”
  9. Best05 represents the best system of the Second International Chinese Word Segmentation Bakeoff on the corresponding data; CRF + rule-system represents confidence-based combination of CRF and rule-based models, presented in Zhang et al.
    Page 8, “System Architecture”
  10. In this paper, we presented a joint model for Chinese word segmentation and new word detection.
    Page 8, “System Architecture”

See all papers in Proc. ACL 2012 that mention Chinese word.

See all papers in Proc. ACL that mention Chinese word.

Back to top.

Chinese word segmentation

Appears in 10 sentences as: Chinese Word Segmentation (2) Chinese word segmentation (8)
In Fast Online Training with Frequency-Adaptive Learning Rates for Chinese Word Segmentation and New Word Detection
  1. We present a joint model for Chinese word segmentation and new word detection.
    Page 1, “Abstract”
  2. The major problem of Chinese word segmentation is the ambiguity.
    Page 1, “Introduction”
  3. In this paper, we present high dimensional new features, including word-based features and enriched edge (label-transition) features, for the joint modeling of Chinese word segmentation (CWS) and new word detection (NWD).
    Page 1, “Introduction”
  4. 0 We propose a joint model for Chinese word segmentation and new word detection.
    Page 2, “Introduction”
  5. Conventional approaches to Chinese word segmentation treat the problem as a sequential labeling task (Xue, 2003; Peng et al., 2004; Tseng et al., 2005; Asahara et al., 2005; Zhao et al., 2010).
    Page 2, “Related Work”
  6. This phenomenon will also undermine the performance of Chinese word segmentation .
    Page 3, “System Architecture”
  7. The B, I, E labels have been widely used in previous work of Chinese word segmentation (Sun et al., 2009b).
    Page 4, “System Architecture”
  8. _ We used benchmark datasets provided by the second International Chinese Word Segmentation Bakeoff to test our proposals.
    Page 6, “System Architecture”
  9. Best05 represents the best system of the Second International Chinese Word Segmentation Bakeoff on the corresponding data; CRF + rule-system represents confidence-based combination of CRF and rule-based models, presented in Zhang et al.
    Page 8, “System Architecture”
  10. In this paper, we presented a joint model for Chinese word segmentation and new word detection.
    Page 8, “System Architecture”

See all papers in Proc. ACL 2012 that mention Chinese word segmentation.

See all papers in Proc. ACL that mention Chinese word segmentation.

Back to top.

CRFs

Appears in 9 sentences as: CRFs (11)
In Fast Online Training with Frequency-Adaptive Learning Rates for Chinese Word Segmentation and New Word Detection
  1. While most of the state-of-the-art CWS systems used semi-Markov conditional random fields or latent variable conditional random fields, we simply use a single first-order conditional random fields ( CRFs ) for the joint modeling.
    Page 1, “Introduction”
  2. The semi-Markov CRFs and latent variable CRFs relax the Markov assumption of CRFs to express more complicated dependencies, and therefore to achieve higher disambiguation power.
    Page 1, “Introduction”
  3. Alternatively, our plan is not to relax Markov assumption of CRFs , but to exploit more complicated dependencies via using refined high-dimensional features.
    Page 1, “Introduction”
  4. 3.1 A Joint Model Based on CRFs
    Page 3, “System Architecture”
  5. First, we briefly review CRFs .
    Page 3, “System Architecture”
  6. CRFs are proposed as a method for structured classification by solving “the label bias problem” (Lafferty et al., 2001).
    Page 3, “System Architecture”
  7. The first idea is to exploit word features for node features of CRFs .
    Page 3, “System Architecture”
  8. CRFs also have edge features that are based on label transitions.
    Page 4, “System Architecture”
  9. we proposed a new training method, ADF training, for very fast training of CRFs , even given large-scale datasets with high dimensional features.
    Page 8, “System Architecture”

See all papers in Proc. ACL 2012 that mention CRFs.

See all papers in Proc. ACL that mention CRFs.

Back to top.

bigram

Appears in 8 sentences as: bigram (8) bigrams (4)
In Fast Online Training with Frequency-Adaptive Learning Rates for Chinese Word Segmentation and New Word Detection
  1. To derive word features, first of all, our system automatically collect a list of word unigrams and bigrams from the training data.
    Page 3, “System Architecture”
  2. To avoid overfitting, we only collect the word unigrams and bigrams whose frequency is larger than 2 in the training set.
    Page 3, “System Architecture”
  3. This list of word unigrams and bigrams are then used as a unigram-dictionary and a bigram-dictionary to generate word-based unigram and bigram features.
    Page 3, “System Architecture”
  4. The word-based features are indicator functions that fire when the local character sequence matches a word unigram or bigram occurred in the training data.
    Page 3, “System Architecture”
  5. 0 bigraml(m, <— [$j,i_1, LEM, yi], if the word bigram candidate [$j,7;_1, 13%] hits a word bigram [10,, wj] E B, and satisfies the aforementioned constraints on j and k. B represents the word bigram dictionary collected from the training data.
    Page 4, “System Architecture”
  6. 0 bigram2(x, <— [$j,i, $i+1,k, yi], the word bigram candidate [my-,i, $i+1,k] hits a word bigram [10,, wj] E B, and satisfies the aforementioned constraints on j and k.
    Page 4, “System Architecture”
  7. 0 Character bigrams locating at positions 73 —2,i—l,iandi—|—l
    Page 4, “System Architecture”
  8. To generate word-based features, we extracted high-frequency word-based unigram and bigram lists from the training data.
    Page 6, “System Architecture”

See all papers in Proc. ACL 2012 that mention bigram.

See all papers in Proc. ACL that mention bigram.

Back to top.

joint modeling

Appears in 8 sentences as: Joint Model (1) joint model (3) joint modeling (4)
In Fast Online Training with Frequency-Adaptive Learning Rates for Chinese Word Segmentation and New Word Detection
  1. We present a joint model for Chinese word segmentation and new word detection.
    Page 1, “Abstract”
  2. We present high dimensional new features, including word-based features and enriched edge (label-transition) features, for the joint modeling .
    Page 1, “Abstract”
  3. In this paper, we present high dimensional new features, including word-based features and enriched edge (label-transition) features, for the joint modeling of Chinese word segmentation (CWS) and new word detection (NWD).
    Page 1, “Introduction”
  4. While most of the state-of-the-art CWS systems used semi-Markov conditional random fields or latent variable conditional random fields, we simply use a single first-order conditional random fields (CRFs) for the joint modeling .
    Page 1, “Introduction”
  5. 0 We propose a joint model for Chinese word segmentation and new word detection.
    Page 2, “Introduction”
  6. 3.1 A Joint Model Based on CRFs
    Page 3, “System Architecture”
  7. In this paper, we presented a joint model for Chinese word segmentation and new word detection.
    Page 8, “System Architecture”
  8. We presented new features, including word-based features and enriched edge features, for the joint modeling .
    Page 8, “System Architecture”

See all papers in Proc. ACL 2012 that mention joint modeling.

See all papers in Proc. ACL that mention joint modeling.

Back to top.

CRF

Appears in 6 sentences as: CRF (7)
In Fast Online Training with Frequency-Adaptive Learning Rates for Chinese Word Segmentation and New Word Detection
  1. Based on our CRF word segmentation system, we can compute a probability for each segment.
    Page 3, “System Architecture”
  2. Note that, although our model is a Markov CRF model, we can still use word features to learn word information in the training data.
    Page 3, “System Architecture”
  3. For traditional implementation of CRF systems (e.g., the HCRF package), usually the edges features contain only the information of yi_1 and y, and without the information of
    Page 4, “System Architecture”
  4. The major reason for this simple realization of edge features in traditional CRF implementation is for reducing the dimension of features.
    Page 4, “System Architecture”
  5. As we will show in experiments, the training of the CRF model with high-dimensional new features is quite expensive, and the existing training method is not good enough.
    Page 4, “System Architecture”
  6. Best05 represents the best system of the Second International Chinese Word Segmentation Bakeoff on the corresponding data; CRF + rule-system represents confidence-based combination of CRF and rule-based models, presented in Zhang et al.
    Page 8, “System Architecture”

See all papers in Proc. ACL 2012 that mention CRF.

See all papers in Proc. ACL that mention CRF.

Back to top.

feature templates

Appears in 6 sentences as: feature templates (6)
In Fast Online Training with Frequency-Adaptive Learning Rates for Chinese Word Segmentation and New Word Detection
  1. The word-based feature templates derived for the label y, are as follows:
    Page 3, “System Architecture”
  2. For each label y, we use the feature templates as follows:
    Page 4, “System Architecture”
  3. The latter two feature templates are designed to detect character or word reduplication, a morphological phenomenon that can influence word segmentation in Chinese.
    Page 4, “System Architecture”
  4. Therefore, incorporating local observation information into the edge feature will result in an explosion of edge features, which is 1,600 times larger than the number of feature templates .
    Page 4, “System Architecture”
  5. There are only nine possible label transitions: T = Y x Y and |T| = 9.2 As a result, the feature dimension will have nine times increase over the feature templates , if we incorporate local observation information of a: into the edge features.
    Page 4, “System Architecture”
  6. We employed the feature templates defined in Section 3.2.
    Page 6, “System Architecture”

See all papers in Proc. ACL 2012 that mention feature templates.

See all papers in Proc. ACL that mention feature templates.

Back to top.

latent variable

Appears in 6 sentences as: latent variable (5) latent variables (1)
In Fast Online Training with Frequency-Adaptive Learning Rates for Chinese Word Segmentation and New Word Detection
  1. While most of the state-of-the-art CWS systems used semi-Markov conditional random fields or latent variable conditional random fields, we simply use a single first-order conditional random fields (CRFs) for the joint modeling.
    Page 1, “Introduction”
  2. The semi-Markov CRFs and latent variable CRFs relax the Markov assumption of CRFs to express more complicated dependencies, and therefore to achieve higher disambiguation power.
    Page 1, “Introduction”
  3. To achieve high accuracy, most of the state-of-the-art systems are heavy probabilistic systems using semi-Markov assumptions or latent variables (Andrew, 2006; Sun et al., 2009b).
    Page 2, “Related Work”
  4. For example, one of the state-of-the-art CWS system is the latent variable conditional random field (Sun et al., 2008; Sun and Tsujii, 2009) system presented in Sun et al.
    Page 2, “Related Work”
  5. Those semi-Markov perceptron systems are moderately faster than the heavy probabilistic systems using semi-Markov conditional random fields or latent variable conditional random fields.
    Page 2, “Related Work”
  6. Other online training methods includes averaged SGD with feedback (Sun et al., 2010; Sun et al., 2011), latent variable perceptron training (Sun et al., 2009a), and so on.
    Page 3, “Related Work”

See all papers in Proc. ACL 2012 that mention latent variable.

See all papers in Proc. ACL that mention latent variable.

Back to top.

unigrams

Appears in 6 sentences as: unigram (3) unigrams (4)
In Fast Online Training with Frequency-Adaptive Learning Rates for Chinese Word Segmentation and New Word Detection
  1. To derive word features, first of all, our system automatically collect a list of word unigrams and bigrams from the training data.
    Page 3, “System Architecture”
  2. To avoid overfitting, we only collect the word unigrams and bigrams whose frequency is larger than 2 in the training set.
    Page 3, “System Architecture”
  3. This list of word unigrams and bigrams are then used as a unigram-dictionary and a bigram-dictionary to generate word-based unigram and bigram features.
    Page 3, “System Architecture”
  4. The word-based features are indicator functions that fire when the local character sequence matches a word unigram or bigram occurred in the training data.
    Page 3, “System Architecture”
  5. 0 Character unigrams locating at positions 73 — 2, i—l,i,i+landi+2
    Page 4, “System Architecture”
  6. To generate word-based features, we extracted high-frequency word-based unigram and bigram lists from the training data.
    Page 6, “System Architecture”

See all papers in Proc. ACL 2012 that mention unigrams.

See all papers in Proc. ACL that mention unigrams.

Back to top.

perceptron

Appears in 5 sentences as: perceptron (5)
In Fast Online Training with Frequency-Adaptive Learning Rates for Chinese Word Segmentation and New Word Detection
  1. A few other state-of-the-art CWS systems are using semi-Markov perceptron methods or voting systems based on multiple semi-Markov
    Page 2, “Related Work”
  2. perceptron segmenters (Zhang and Clark, 2007; Sun, 2010).
    Page 2, “Related Work”
  3. Those semi-Markov perceptron systems are moderately faster than the heavy probabilistic systems using semi-Markov conditional random fields or latent variable conditional random fields.
    Page 2, “Related Work”
  4. However, a disadvantage of the perceptron style systems is that they can not provide probabilistic information.
    Page 2, “Related Work”
  5. Other online training methods includes averaged SGD with feedback (Sun et al., 2010; Sun et al., 2011), latent variable perceptron training (Sun et al., 2009a), and so on.
    Page 3, “Related Work”

See all papers in Proc. ACL 2012 that mention perceptron.

See all papers in Proc. ACL that mention perceptron.

Back to top.

learning algorithm

Appears in 4 sentences as: learning algorithm (3) learning algorithms (1)
In Fast Online Training with Frequency-Adaptive Learning Rates for Chinese Word Segmentation and New Word Detection
  1. ADF learning algorithm : procedure ADF(q, c, a, [3)
    Page 5, “System Architecture”
  2. Figure 1: The proposed ADF online learning algorithm .
    Page 5, “System Architecture”
  3. Prior work on convergence analysis of existing online learning algorithms (Murata, 1998; Hsu et
    Page 5, “System Architecture”
  4. We can show that the proposed ADF learning algorithm has reasonable convergence properties.
    Page 6, “System Architecture”

See all papers in Proc. ACL 2012 that mention learning algorithm.

See all papers in Proc. ACL that mention learning algorithm.

Back to top.

objective function

Appears in 4 sentences as: objective function (3) objective function, (1)
In Fast Online Training with Frequency-Adaptive Learning Rates for Chinese Word Segmentation and New Word Detection
  1. The SGD uses a small randomly-selected subset of the training samples to approximate the gradient of an objective function .
    Page 2, “Related Work”
  2. .n, parameter estimation is performed by maximizing the objective function,
    Page 3, “System Architecture”
  3. The final objective function is as follows:
    Page 3, “System Architecture”
  4. t E('wt) = 'w* + H (I — vofimH(w*))('wo — 10*), m=1 where w* is the optimal weight vector, and H is the Hessian matrix of the objective function .
    Page 6, “System Architecture”

See all papers in Proc. ACL 2012 that mention objective function.

See all papers in Proc. ACL that mention objective function.

Back to top.

overfitting

Appears in 3 sentences as: overfitting (3)
In Fast Online Training with Frequency-Adaptive Learning Rates for Chinese Word Segmentation and New Word Detection
  1. The second term is a regularizer for reducing overfitting .
    Page 3, “System Architecture”
  2. To avoid overfitting , we only collect the word unigrams and bigrams whose frequency is larger than 2 in the training set.
    Page 3, “System Architecture”
  3. To reduce overfitting , we employed an L2 Gaussian weight prior (Chen and Rosenfeld, 1999) for all training methods.
    Page 7, “System Architecture”

See all papers in Proc. ACL 2012 that mention overfitting.

See all papers in Proc. ACL that mention overfitting.

Back to top.

significantly improve

Appears in 3 sentences as: significantly improve (2) significantly improves (1)
In Fast Online Training with Frequency-Adaptive Learning Rates for Chinese Word Segmentation and New Word Detection
  1. We will show that this can significantly improve the convergence speed of online learning.
    Page 2, “Introduction”
  2. We found adding new edge features significantly improves the disambiguation power of our model.
    Page 4, “System Architecture”
  3. has its own learning rate, and we will show that this can significantly improve the convergence speed of online learning.
    Page 5, “System Architecture”

See all papers in Proc. ACL 2012 that mention significantly improve.

See all papers in Proc. ACL that mention significantly improve.

Back to top.