Max-Margin Tensor Neural Network for Chinese Word Segmentation
Pei, Wenzhe and Ge, Tao and Chang, Baobao

Article Structure

Abstract

Recently, neural network models for natural language processing tasks have been increasingly focused on for their ability to alleviate the burden of manual feature engineering.

Introduction

Unlike English and other western languages, Chinese do not delimit words by whitespace.

Conventional Neural Network

2.1 Lookup Table

Max-Margin Tensor Neural Network

3.1 Tag Embedding

Experiment

4.1 Data and Model Selection

Related Work

Chinese word segmentation has been studied with considerable efforts in the NLP community.

Conclusion

In this paper, we propose a new model called Max-Margin Tensor Neural Network that explicitly models the interactions between tags and context characters.

Topics

neural network

Appears in 39 sentences as: Neural Network (8) neural network (31) neural networks (1)
In Max-Margin Tensor Neural Network for Chinese Word Segmentation
  1. Recently, neural network models for natural language processing tasks have been increasingly focused on for their ability to alleviate the burden of manual feature engineering.
    Page 1, “Abstract”
  2. In this paper, we propose a novel neural network model for Chinese word segmentation called Max-Margin Tensor Neural Network (MMTNN).
    Page 1, “Abstract”
  3. Experiments on the benchmark dataset show that our model achieves better performances than previous neural network models and that our model can achieve a competitive performance with minimal feature engineering.
    Page 1, “Abstract”
  4. Recently, neural network models have been increasingly focused on for their ability to minimize the effort in feature engineering.
    Page 1, “Introduction”
  5. Workable as previous neural network models seem, a limitation of them to be pointed out is that the tag-tag interaction, tag-character interaction and character-character interaction are not well modeled.
    Page 1, “Introduction”
  6. In previous neural network models, however, hardly can such interactional effects be fully captured relying only on the simple transition score and the single nonlinear transformation (See section 2).
    Page 1, “Introduction”
  7. In order to address this problem, we propose a new model called Max-Margin Tensor Neural Network (MMTNN) that explicitly models the interactions
    Page 1, “Introduction”
  8. Experiment results show that our model outperforms other neural network models.
    Page 2, “Introduction”
  9. 0 We propose a Max-Margin Tensor Neural Network for Chinese word segmentation without feature engineering.
    Page 2, “Introduction”
  10. The test results on the benchmark dataset show that our model outperforms previous neural network models.
    Page 2, “Introduction”
  11. Section 2 describes the details of conventional neural network architecture.
    Page 2, “Introduction”

See all papers in Proc. ACL 2014 that mention neural network.

See all papers in Proc. ACL that mention neural network.

Back to top.

embeddings

Appears in 33 sentences as: embeddings (43)
In Max-Margin Tensor Neural Network for Chinese Word Segmentation
  1. By exploiting tag embeddings and tensor-based transformation, MMTNN has the ability to model complicated interactions between tags and context characters.
    Page 1, “Abstract”
  2. between tags and context characters by exploiting tag embeddings and tensor-based transformation.
    Page 2, “Introduction”
  3. The character embeddings are then stacked into a embedding matrix M 6 1Rde |.
    Page 2, “Conventional Neural Network”
  4. We will analyze in more detail about the effect of character embeddings in Section 4.
    Page 3, “Conventional Neural Network”
  5. The character embeddings extracted by the Lookup Table layer are then concatenated into a single vector a 6 Km, where H1 2 w - d is the size of Layer 1.
    Page 3, “Conventional Neural Network”
  6. Similar to character embeddings, given a fixed-sized tag set T, the tag embeddings for tags are stored in a tag embedding matrix L E Rdx m, where d is the dimensionality
    Page 3, “Max-Margin Tensor Neural Network”
  7. of the vector space (same with character embeddings ).
    Page 4, “Max-Margin Tensor Neural Network”
  8. The tag embeddings start from a random initialization and can be automatically trained by back-propagation.
    Page 4, “Max-Margin Tensor Neural Network”
  9. Figure 2 shows the new Lookup Table layer with tag embeddings .
    Page 4, “Max-Margin Tensor Neural Network”
  10. Assuming we are at the i-th character of a sentence, besides the character embeddings, the tag embeddings of the previous tags are also consideredl.
    Page 4, “Max-Margin Tensor Neural Network”
  11. The concatenation operation in Layer 1 then concatenates the character embeddings and tag embedding together into a long vector a.
    Page 4, “Max-Margin Tensor Neural Network”

See all papers in Proc. ACL 2014 that mention embeddings.

See all papers in Proc. ACL that mention embeddings.

Back to top.

word segmentation

Appears in 19 sentences as: Word Segmentation (2) word segmentation (19)
In Max-Margin Tensor Neural Network for Chinese Word Segmentation
  1. In this paper, we propose a novel neural network model for Chinese word segmentation called Max-Margin Tensor Neural Network (MMTNN).
    Page 1, “Abstract”
  2. Despite Chinese word segmentation being a specific case, MMTNN can be easily generalized and applied to other sequence labeling tasks.
    Page 1, “Abstract”
  3. Therefore, word segmentation is a preliminary and important pre-process for Chinese language processing.
    Page 1, “Introduction”
  4. (2011) to Chinese word segmentation and POS tagging and proposed a perceptron-style algorithm to speed up the training process with negligible loss in performance.
    Page 1, “Introduction”
  5. We evaluate the performance of Chinese word segmentation on the PKU and MSRA benchmark datasets in the second International Chinese Word Segmentation Bakeoff (Emerson, 2005) which are commonly used for evaluation of Chinese word segmentation .
    Page 2, “Introduction”
  6. 0 We propose a Max-Margin Tensor Neural Network for Chinese word segmentation without feature engineering.
    Page 2, “Introduction”
  7. 0 Despite Chinese word segmentation being a specific case, our approach can be easily generalized to other sequence labeling tasks.
    Page 2, “Introduction”
  8. Formally, in the Chinese word segmentation task, we have a character dictionary D of size Unless otherwise specified, the character dictionary is extracted from the training set and unknown characters are mapped to a special symbol that is not used elsewhere.
    Page 2, “Conventional Neural Network”
  9. In Chinese word segmentation , the most prevalent tag set T is BMES tag set, which uses 4 tags to carry word boundary information.
    Page 3, “Conventional Neural Network”
  10. (2013) modeled Chinese word segmentation as a series of
    Page 3, “Conventional Neural Network”
  11. Moreover, the simple nonlinear transformation in equation (2) is also poor to model the complex interactional effects in Chinese word segmentation .
    Page 3, “Conventional Neural Network”

See all papers in Proc. ACL 2014 that mention word segmentation.

See all papers in Proc. ACL that mention word segmentation.

Back to top.

Chinese word

Appears in 16 sentences as: Chinese Word (2) Chinese word (16)
In Max-Margin Tensor Neural Network for Chinese Word Segmentation
  1. In this paper, we propose a novel neural network model for Chinese word segmentation called Max-Margin Tensor Neural Network (MMTNN).
    Page 1, “Abstract”
  2. Despite Chinese word segmentation being a specific case, MMTNN can be easily generalized and applied to other sequence labeling tasks.
    Page 1, “Abstract”
  3. (2011) to Chinese word segmentation and POS tagging and proposed a perceptron-style algorithm to speed up the training process with negligible loss in performance.
    Page 1, “Introduction”
  4. We evaluate the performance of Chinese word segmentation on the PKU and MSRA benchmark datasets in the second International Chinese Word Segmentation Bakeoff (Emerson, 2005) which are commonly used for evaluation of Chinese word segmentation.
    Page 2, “Introduction”
  5. 0 We propose a Max-Margin Tensor Neural Network for Chinese word segmentation without feature engineering.
    Page 2, “Introduction”
  6. 0 Despite Chinese word segmentation being a specific case, our approach can be easily generalized to other sequence labeling tasks.
    Page 2, “Introduction”
  7. Formally, in the Chinese word segmentation task, we have a character dictionary D of size Unless otherwise specified, the character dictionary is extracted from the training set and unknown characters are mapped to a special symbol that is not used elsewhere.
    Page 2, “Conventional Neural Network”
  8. In Chinese word segmentation, the most prevalent tag set T is BMES tag set, which uses 4 tags to carry word boundary information.
    Page 3, “Conventional Neural Network”
  9. (2013) modeled Chinese word segmentation as a series of
    Page 3, “Conventional Neural Network”
  10. Moreover, the simple nonlinear transformation in equation (2) is also poor to model the complex interactional effects in Chinese word segmentation.
    Page 3, “Conventional Neural Network”
  11. In Chinese word segmentation, a proper modeling of the tag-tag interaction, tag-character interaction and character-character interaction is very important.
    Page 4, “Max-Margin Tensor Neural Network”

See all papers in Proc. ACL 2014 that mention Chinese word.

See all papers in Proc. ACL that mention Chinese word.

Back to top.

Chinese word segmentation

Appears in 16 sentences as: Chinese Word Segmentation (2) Chinese word segmentation (16)
In Max-Margin Tensor Neural Network for Chinese Word Segmentation
  1. In this paper, we propose a novel neural network model for Chinese word segmentation called Max-Margin Tensor Neural Network (MMTNN).
    Page 1, “Abstract”
  2. Despite Chinese word segmentation being a specific case, MMTNN can be easily generalized and applied to other sequence labeling tasks.
    Page 1, “Abstract”
  3. (2011) to Chinese word segmentation and POS tagging and proposed a perceptron-style algorithm to speed up the training process with negligible loss in performance.
    Page 1, “Introduction”
  4. We evaluate the performance of Chinese word segmentation on the PKU and MSRA benchmark datasets in the second International Chinese Word Segmentation Bakeoff (Emerson, 2005) which are commonly used for evaluation of Chinese word segmentation .
    Page 2, “Introduction”
  5. 0 We propose a Max-Margin Tensor Neural Network for Chinese word segmentation without feature engineering.
    Page 2, “Introduction”
  6. 0 Despite Chinese word segmentation being a specific case, our approach can be easily generalized to other sequence labeling tasks.
    Page 2, “Introduction”
  7. Formally, in the Chinese word segmentation task, we have a character dictionary D of size Unless otherwise specified, the character dictionary is extracted from the training set and unknown characters are mapped to a special symbol that is not used elsewhere.
    Page 2, “Conventional Neural Network”
  8. In Chinese word segmentation , the most prevalent tag set T is BMES tag set, which uses 4 tags to carry word boundary information.
    Page 3, “Conventional Neural Network”
  9. (2013) modeled Chinese word segmentation as a series of
    Page 3, “Conventional Neural Network”
  10. Moreover, the simple nonlinear transformation in equation (2) is also poor to model the complex interactional effects in Chinese word segmentation .
    Page 3, “Conventional Neural Network”
  11. In Chinese word segmentation , a proper modeling of the tag-tag interaction, tag-character interaction and character-character interaction is very important.
    Page 4, “Max-Margin Tensor Neural Network”

See all papers in Proc. ACL 2014 that mention Chinese word segmentation.

See all papers in Proc. ACL that mention Chinese word segmentation.

Back to top.

bigram

Appears in 11 sentences as: bigram (13)
In Max-Margin Tensor Neural Network for Chinese Word Segmentation
  1. Therefore, we integrate additional simple character bigram features into our model and the result shows that our model can achieve a competitive performance that other systems hardly achieve unless they use more complex task-specific features.
    Page 2, “Introduction”
  2. Model PKU MSRA Best05(Chen et al., 2005) 95.0 96.0 Best05(Tseng et al., 2005) 95.0 96.4 (Zhang et al., 2006) 95.1 97.1 (Zhang and Clark, 2007) 94.5 97.2 (Sun et al., 2009) 95.2 97.3 (Sun et al., 2012) 95.4 97.4 (Zhang et al., 2013) 96.1 97.4 MMTNN 94.0 94.9 MMTNN + bigram 95.2 97.2
    Page 8, “Experiment”
  3. A very common feature in Chinese word segmentation is the character bigram feature.
    Page 8, “Experiment”
  4. Formally, at the i-th character of a sentence cum] , the bigram features are ckck+1(i — 3 < k < z' + 2).
    Page 8, “Experiment”
  5. In our model, the bigram features are extracted in the window context and then the corresponding bigram embeddings are concatenated with character embeddings in Layer 1 and fed into Layer 2.
    Page 8, “Experiment”
  6. (2013), the bigram embeddings are pre-trained on unlabeled data with character embeddings, which significantly improves the model performance.
    Page 8, “Experiment”
  7. Given the long time for pre-training bigram embeddings, we only pre-train the character embeddings and the bigram embeddings are initialized as the average of character embeddings of ck; and CH1.
    Page 8, “Experiment”
  8. Further improvement could be obtained if the bigram embeddings are also pre-trained.
    Page 8, “Experiment”
  9. When bigram features are added, the F-score of our model improves
    Page 8, “Experiment”
  10. It is a competitive result given that our model only use simple bigram features while other models use more complex features.
    Page 9, “Experiment”
  11. Most previous systems address this task by using linear statistical models with carefully designed features such as bigram features, punctuation information (Li and Sun, 2009) and statistical information (Sun and Xu, 2011).
    Page 9, “Related Work”

See all papers in Proc. ACL 2014 that mention bigram.

See all papers in Proc. ACL that mention bigram.

Back to top.

overfitting

Appears in 10 sentences as: overfit (1) overfitting (9)
In Max-Margin Tensor Neural Network for Chinese Word Segmentation
  1. Furthermore, a new tensor factorization approach is proposed to speed up the model and avoid overfitting .
    Page 1, “Abstract”
  2. by the design of features and the number of features could be so large that the result models are too large for practical use and prone to overfit on training corpus.
    Page 1, “Introduction”
  3. Moreover, we propose a tensor factorization approach that effectively improves the model efficiency and prevents from overfitting .
    Page 2, “Introduction”
  4. Not only does this approach improve the efficiency of our model but also it avoids the risk of overfitting .
    Page 2, “Introduction”
  5. Moreover, the additional tensor could bring millions of parameters to the model which makes the model suffer from the risk of overfitting .
    Page 5, “Max-Margin Tensor Neural Network”
  6. As long as 7“ is small enough, the factorized tensor operation would be much faster than the un-factorized one and the number of free parameters would also be much smaller, which prevent the model from overfitting .
    Page 5, “Max-Margin Tensor Neural Network”
  7. However, given the small size of their tensor matrix, they do not have the problem of high time cost and overfitting problem as we faced in modeling a sequence labeling task like Chinese word segmentation.
    Page 9, “Related Work”
  8. That’s why we propose to decrease computational cost and avoid overfitting with tensor factorization.
    Page 9, “Related Work”
  9. By introducing tensor factorization into the neural network model for sequence labeling tasks, the model training and inference are speeded up and overfitting is prevented.
    Page 9, “Related Work”
  10. Moreover, we propose a tensor factorization approach that effectively improves the model efficiency and avoids the risk of overfitting .
    Page 9, “Conclusion”

See all papers in Proc. ACL 2014 that mention overfitting.

See all papers in Proc. ACL that mention overfitting.

Back to top.

sequence labeling

Appears in 8 sentences as: sequence labeling (8)
In Max-Margin Tensor Neural Network for Chinese Word Segmentation
  1. Despite Chinese word segmentation being a specific case, MMTNN can be easily generalized and applied to other sequence labeling tasks.
    Page 1, “Abstract”
  2. Most previous systems address this problem by treating this task as a sequence labeling problem where each character is assigned a tag indicating its position in the word.
    Page 1, “Introduction”
  3. (2011) developed the SENNA system that approaches or surpasses the state-of-the-art systems on a variety of sequence labeling tasks for English.
    Page 1, “Introduction”
  4. 0 Despite Chinese word segmentation being a specific case, our approach can be easily generalized to other sequence labeling tasks.
    Page 2, “Introduction”
  5. Despite tensor-based transformation being effective for capturing the interactions, introducing tensor-based transformation into neural network models to solve sequence labeling task is time prohibitive since the tensor product operation drastically slows down the model.
    Page 5, “Max-Margin Tensor Neural Network”
  6. The most popular approach treats word segmentation as a sequence labeling problem which was first proposed in Xue (2003).
    Page 9, “Related Work”
  7. However, given the small size of their tensor matrix, they do not have the problem of high time cost and overfitting problem as we faced in modeling a sequence labeling task like Chinese word segmentation.
    Page 9, “Related Work”
  8. By introducing tensor factorization into the neural network model for sequence labeling tasks, the model training and inference are speeded up and overfitting is prevented.
    Page 9, “Related Work”

See all papers in Proc. ACL 2014 that mention sequence labeling.

See all papers in Proc. ACL that mention sequence labeling.

Back to top.

CRF

Appears in 6 sentences as: CRF (6) CRF+ (1)
In Max-Margin Tensor Neural Network for Chinese Word Segmentation
  1. P R F OOV CRF 87.8 85.7 86.7 57.1 NN 92.4 92.2 92.3 60.0 NN+Tag Embed 93.0 92.7 92.9 61.0 MMTNN 93.7 93.4 93.5 64.2
    Page 7, “Experiment”
  2. We also compare our model with the CRF model (Lafferty et al., 2001), which is a widely used log-linear model for Chinese word segmentation.
    Page 7, “Experiment”
  3. The input feature to the CRF model is simply the context characters (unigram feature) without any additional feature engineering.
    Page 7, “Experiment”
  4. We use an open source toolkit CRF+ +4 to train the CRF model.
    Page 7, “Experiment”
  5. Compared with CRF , there are two differences in neural network models.
    Page 7, “Experiment”
  6. In fact, CRF can be regarded as a special neural network without nonlinear function (Wang and Manning, 2013).
    Page 7, “Experiment”

See all papers in Proc. ACL 2014 that mention CRF.

See all papers in Proc. ACL that mention CRF.

Back to top.

distributed representation

Appears in 5 sentences as: distributed representation (5)
In Max-Margin Tensor Neural Network for Chinese Word Segmentation
  1. The idea of distributed representation for symbolic data is one of the most important reasons why the neural network works.
    Page 2, “Conventional Neural Network”
  2. To better model the tag-tag interaction given the context characters, distributed representation for tags instead of traditional discrete symbolic representation is used in our model.
    Page 3, “Max-Margin Tensor Neural Network”
  3. Wang and Manning (2013) conduct an empirical study on the effect of nonlinearity and the results suggest that nonlinear models are highly effective only when distributed representation is used.
    Page 7, “Experiment”
  4. To explain why distributed representation captures more information than discrete features, we show in Table 4 the effect of character embeddings which are obtained from the lookup table of MMTNN after training.
    Page 7, “Experiment”
  5. Therefore, compared with discrete feature representations, distributed representation can capture the syntactic and semantic similarity between characters.
    Page 7, “Experiment”

See all papers in Proc. ACL 2014 that mention distributed representation.

See all papers in Proc. ACL that mention distributed representation.

Back to top.

F-score

Appears in 4 sentences as: F-score (4)
In Max-Margin Tensor Neural Network for Chinese Word Segmentation
  1. As we can see, by using Tag embedding, the F-score is improved by +0.6% and 00V recall is improved by +1 .0%, which shows that tag embeddings succeed in modeling the tag-tag interaction and tag-character interaction.
    Page 7, “Experiment”
  2. The F-score is improved by +0.6% while OOV recall is improved by +3.2%, which denotes that tensor-based transformation captures more interactional information than simple nonlinear transformation.
    Page 7, “Experiment”
  3. As shown in Table 5 (last three rows), both the F-score and 00V recall of our model boost by using pre-training.
    Page 8, “Experiment”
  4. When bigram features are added, the F-score of our model improves
    Page 8, “Experiment”

See all papers in Proc. ACL 2014 that mention F-score.

See all papers in Proc. ACL that mention F-score.

Back to top.

highest scoring

Appears in 4 sentences as: highest score (1) highest score: (1) highest scoring (2)
In Max-Margin Tensor Neural Network for Chinese Word Segmentation
  1. For a given training instance (sci, yi), we search for the tag sequence with the highest score:
    Page 6, “Max-Margin Tensor Neural Network”
  2. The object of Max-Margin training is that the highest scoring tag sequence is the correct one: y* = yz- and its score will be larger up to a margin to other possible tag sequences 3) E Y(:ci):
    Page 6, “Max-Margin Tensor Neural Network”
  3. By minimizing this object, the score of the correct tag sequence yz- is increased and score of the highest scoring incorrect tag sequence 3) is decreased.
    Page 6, “Max-Margin Tensor Neural Network”
  4. where gym“ is the tag sequence with the highest score in equation (13).
    Page 6, “Max-Margin Tensor Neural Network”

See all papers in Proc. ACL 2014 that mention highest scoring.

See all papers in Proc. ACL that mention highest scoring.

Back to top.

sentence-level

Appears in 4 sentences as: sentence-level (4)
In Max-Margin Tensor Neural Network for Chinese Word Segmentation
  1. To model the tag dependency, previous neural network models (Collobert et al., 2011; Zheng et al., 2013) introduce a transition score Aij for jumping from tag i E T to tag j E T. For a input sentence cum] with a tag sequence tum], a sentence-level score is then given by the sum of transition and network scores:
    Page 3, “Conventional Neural Network”
  2. Given the sentence-level score, Zheng et al.
    Page 3, “Conventional Neural Network”
  3. (2013), their model is a global one where the training and inference is performed at sentence-level .
    Page 3, “Conventional Neural Network”
  4. (2013), our model is also trained at sentence-level and carries out inference globally.
    Page 4, “Max-Margin Tensor Neural Network”

See all papers in Proc. ACL 2014 that mention sentence-level.

See all papers in Proc. ACL that mention sentence-level.

Back to top.

unlabeled data

Appears in 3 sentences as: unlabeled data (3)
In Max-Margin Tensor Neural Network for Chinese Word Segmentation
  1. Previous work found that the performance can be improved by pre-training the character embeddings on large unlabeled data and using the obtained embeddings to initialize the character lookup table instead of random initialization
    Page 7, “Experiment”
  2. There are several ways to learn the embeddings on unlabeled data .
    Page 8, “Experiment”
  3. (2013), the bigram embeddings are pre-trained on unlabeled data with character embeddings, which significantly improves the model performance.
    Page 8, “Experiment”

See all papers in Proc. ACL 2014 that mention unlabeled data.

See all papers in Proc. ACL that mention unlabeled data.

Back to top.