Toward Better Chinese Word Segmentation for SMT via Bilingual Constraints
Zeng, Xiaodong and Chao, Lidia S. and Wong, Derek F. and Trancoso, Isabel and Tian, Liang

Article Structure

Abstract

This study investigates on building a better Chinese word segmentation model for statistical machine translation.

Introduction

Word segmentation is regarded as a critical procedure for high-level Chinese language processing tasks, since Chinese scripts are written in continuous characters without explicit word boundaries (e.g., space in English).

Related Work

In the literature, many approaches have been proposed to learn CWS models for SMT.

Methodology

This work aims at building a CWS model adapted to the SMT task.

Experiments

4.1 Data and Setup

Conclusion

This paper proposed a novel CWS model for the SMT task.

Topics

CRFs

Appears in 30 sentences as: CRFs (32)
In Toward Better Chinese Word Segmentation for SMT via Bilingual Constraints
  1. We propose dealing with the induced word boundaries as soft constraints to bias the continuous learning of a supervised CRFs model, trained by the treebank data (labeled), on the bilingual data (unlabeled).
    Page 1, “Abstract”
  2. Crucially, the GP expression with the bilingual knowledge is then used as side information to regularize a CRFs (conditional random fields) model’s learning over treebank and bitext data, based on the posterior regularization (PR) framework (Ganchev et al., 2010).
    Page 2, “Introduction”
  3. This constrained learning amounts to a jointly coupling of GP and CRFs , i.e., integrating GP into the estimation of a parametric structural model.
    Page 2, “Introduction”
  4. (2008) enhanced a CRFs segmentation model in MT tasks by tuning the word granularity and improving the segmentation consistence.
    Page 2, “Related Work”
  5. Rather than playing the “hard” uses of the bilingual segmentation knowledge, i.e., directly merging “char-to-word” alignments to words as supervisions, this study extracts word boundary information of characters from the alignments as soft constraints to regularize a CRFs model’s learning.
    Page 2, “Related Work”
  6. (2014), proposed GP for inferring the label information of unlabeled data, and then leverage these GP outcomes to learn a semi-supervised scalable model (e. g., CRFs ).
    Page 2, “Related Work”
  7. But, unlike the prior pipelined approaches, this study performs a joint learning behavior in which GP is used as a learning constraint to interact with the CRFs model estimation.
    Page 2, “Related Work”
  8. One of our main objectives is to bias CRFs model’s learning on unlabeled data, under a nonlinear GP constraint encoding the bilingual knowledge.
    Page 2, “Related Work”
  9. language “Di and “Di: Ensure: 6: the CRFs model parameters : DCHf <— char_align_bitext (133,135) 7“ <— learn_word_bound (“Ba—’10) : Q <— encode_graph_constraint ( $133, 7“) : 6 <— pr_crf_graph (13$?wa Q)
    Page 3, “Methodology”
  10. The GP expression will be defined as a PR constraint in Section 3.3 that reflects the interactions between the graph and the CRFs model.
    Page 4, “Methodology”
  11. Supervised linear-chain CRFs can be modeled in a standard conditional log-likelihood objective with a Gaussian prior:
    Page 5, “Methodology”

See all papers in Proc. ACL 2014 that mention CRFs.

See all papers in Proc. ACL that mention CRFs.

Back to top.

segmentations

Appears in 18 sentences as: segmentations (16) Segmenters (2)
In Toward Better Chinese Word Segmentation for SMT via Bilingual Constraints
  1. The prior works showed that these models help to find some segmentations tailored for SMT, since the bilingual word occurrence feature can be captured by the character-based alignment (Och and Ney, 2003).
    Page 1, “Introduction”
  2. Instead of directly merging the characters into concrete segmentations , this work attempts to extract word boundary distributions for character-level trigrams (types) from the “chars-to-word” mappings.
    Page 2, “Introduction”
  3. It is worth mentioning that prior works presented a straightforward usage for candidate words, treating them as golden segmentations , either dictionary units or labeled resources.
    Page 3, “Methodology”
  4. 0 Self-training Segmenters (STS): two variant models were defined by the approach reported in (Subramanya et al., 2010) that uses the supervised CRFs model’s decodings, incorporating empirical and constraint information, for unlabeled examples as additional labeled data to retrain a CRFs model.
    Page 7, “Experiments”
  5. 0 Virtual Evidences Segmenters (VES): Two variant models based on the approach in (Zeng et al., 2013) were defined.
    Page 7, “Experiments”
  6. This behaviour illustrates that the conventional optimizations to the monolingual supervised model, e. g., accumulating more supervised data or predefined segmentation properties, are insufficient to help model for achieving better segmentations for SMT.
    Page 7, “Experiments”
  7. This section aims to further analyze the three primary observations concluded in Section 4.3: 2') word segmentation is useful to SMT; ii) the treebank and the bilingual segmentation knowledge are helpful, performing segmentation of different nature; and iii) the bilingual constraints lead to learn segmentations better tailored for SMT.
    Page 8, “Experiments”
  8. First, the SMT phrase extraction, i.e., building “phrases” on top of the character sequences, cannot fully capture all meaningful segmentations produced by the CS model.
    Page 8, “Experiments”
  9. Through analyzing both models’ segmentations for trainMT and testMT,
    Page 8, “Experiments”
  10. There have about 35% identical segmentations produced by the two models.
    Page 8, “Experiments”
  11. If these identical segmentations are removed, and the experiments are rerun, the translation scores decrease (on average) by 0.50, 0.85 and 0.70 on BLEU, NIST and METEOR, respectively.
    Page 8, “Experiments”

See all papers in Proc. ACL 2014 that mention segmentations.

See all papers in Proc. ACL that mention segmentations.

Back to top.

treebank

Appears in 16 sentences as: Treebank (1) treebank (16)
In Toward Better Chinese Word Segmentation for SMT via Bilingual Constraints
  1. We propose dealing with the induced word boundaries as soft constraints to bias the continuous learning of a supervised CRFs model, trained by the treebank data (labeled), on the bilingual data (unlabeled).
    Page 1, “Abstract”
  2. The practice in state-of-the-art MT systems is that Chinese sentences are tokenized by a monolingual supervised word segmentation model trained on the hand-annotated treebank data, e.g., Chinese treebank
    Page 1, “Introduction”
  3. But one outstanding problem is that these models may leave out some crucial segmentation features for SMT, since the output words conform to the treebank segmentation standard designed for monolingually linguistic intuition, rather than specific to the SMT task.
    Page 1, “Introduction”
  4. Crucially, the GP expression with the bilingual knowledge is then used as side information to regularize a CRFs (conditional random fields) model’s learning over treebank and bitext data, based on the posterior regularization (PR) framework (Ganchev et al., 2010).
    Page 2, “Introduction”
  5. The input data requires two types of training resources, segmented Chinese sentences from treebank ’ch and parallel unsegmented sentences of Chinese and foreign language “Di and D5.
    Page 3, “Methodology”
  6. Algorithm 1 CWS model induction with bilingual constraints Require: Segmented Chinese sentences from treebank ’ch; Parallel sentences of Chinese and foreign
    Page 3, “Methodology”
  7. As in conventional GP examples (Das and Smith, 2012), a similarity graph Q = (V, E) is constructed over N types extracted from Chinese training data, including treebank ’ch and bitexts “Di.
    Page 4, “Methodology”
  8. Our learning problem belongs to semi-supervised learning (SSL), as the training is done on treebank labeled data (XL,YL) = {(X1,y1), ..., (Xl,yl)}, and bilingual unlabeled data (XU) 2 {X1, ..., Xu} where X,- = {531, ...,:cm} is an input word sequence and yi = {3/1, ...,ym}, y E T is its corresponding label sequence.
    Page 5, “Methodology”
  9. We start from initial parameters 60, estimated by supervised CRFs model training on treebank data.
    Page 5, “Methodology”
  10. The monolingual segmented data, trainTB, is extracted from the Penn Chinese Treebank (CTB-7) (Xue et al., 2005), containing 51,447 sentences.
    Page 6, “Experiments”
  11. 0 Supervised Monolingual Segmenter (SMS): this model is trained by CRFs on treebank training data (trainTB).
    Page 6, “Experiments”

See all papers in Proc. ACL 2014 that mention treebank.

See all papers in Proc. ACL that mention treebank.

Back to top.

segmentation model

Appears in 14 sentences as: segmentation model (9) Segmentation Models (1) segmentation models (3) segmentation models: (1)
In Toward Better Chinese Word Segmentation for SMT via Bilingual Constraints
  1. This study investigates on building a better Chinese word segmentation model for statistical machine translation.
    Page 1, “Abstract”
  2. It aims at leveraging word boundary information, automatically learned by bilingual character-based alignments, to induce a preferable segmentation model .
    Page 1, “Abstract”
  3. The practice in state-of-the-art MT systems is that Chinese sentences are tokenized by a monolingual supervised word segmentation model trained on the hand-annotated treebank data, e.g., Chinese treebank
    Page 1, “Introduction”
  4. In recent years, a number of works (Xu et al., 2005; Chang et al., 2008; Ma and Way, 2009; Xi et al., 2012) attempted to build segmentation models for SMT based on bilingual unsegmented data, instead of monolingual segmented data.
    Page 1, “Introduction”
  5. We propose leveraging the bilingual knowledge to form learning constraints that guide a supervised segmentation model toward a better solution for SMT.
    Page 2, “Introduction”
  6. Section 3 presents the details of the proposed segmentation model .
    Page 2, “Introduction”
  7. (2008) enhanced a CRFs segmentation model in MT tasks by tuning the word granularity and improving the segmentation consistence.
    Page 2, “Related Work”
  8. (2008) produced a better segmentation model for SMT by concatenating various corpora regardless of their different specifications.
    Page 2, “Related Work”
  9. (2011) used the words learned from “chars-to-word” alignments to train a maximum entropy segmentation model .
    Page 2, “Related Work”
  10. An intuitive manner is to directly leverage the induced boundary distributions as label constraints to regularize segmentation model learning, based on a constrained learning algorithm.
    Page 4, “Methodology”
  11. 4.2 Various Segmentation Models
    Page 6, “Experiments”

See all papers in Proc. ACL 2014 that mention segmentation model.

See all papers in Proc. ACL that mention segmentation model.

Back to top.

word segmentation

Appears in 9 sentences as: Word Segmentation (1) Word segmentation (1) word segmentation (7)
In Toward Better Chinese Word Segmentation for SMT via Bilingual Constraints
  1. This study investigates on building a better Chinese word segmentation model for statistical machine translation.
    Page 1, “Abstract”
  2. Word segmentation is regarded as a critical procedure for high-level Chinese language processing tasks, since Chinese scripts are written in continuous characters without explicit word boundaries (e.g., space in English).
    Page 1, “Introduction”
  3. The empirical works show that word segmentation can be beneficial to Chinese-to-English statistical machine translation (SMT) (Xu et al., 2005; Chang et al., 2008; Zhao et al., 2013).
    Page 1, “Introduction”
  4. The practice in state-of-the-art MT systems is that Chinese sentences are tokenized by a monolingual supervised word segmentation model trained on the hand-annotated treebank data, e.g., Chinese treebank
    Page 1, “Introduction”
  5. However, these models tend to miss out other linguistic segmentation patterns as monolingual supervised models, and suffer from the negative effects of erroneously alignments to word segmentation .
    Page 1, “Introduction”
  6. This paper proposes an alternative Chinese Word Segmentation (CWS) model adapted to the SMT task, which seeks not only to maintain the advantages of a monolingual supervised model, having hand-annotated linguistic knowledge, but also to assimilate the relevant bilingual segmenta-
    Page 1, “Introduction”
  7. The influence of the word segmentation on the final translation is our main investigation.
    Page 6, “Experiments”
  8. Firstly, as expected, having word segmentation does help Chinese-to-English MT.
    Page 7, “Experiments”
  9. This section aims to further analyze the three primary observations concluded in Section 4.3: 2') word segmentation is useful to SMT; ii) the treebank and the bilingual segmentation knowledge are helpful, performing segmentation of different nature; and iii) the bilingual constraints lead to learn segmentations better tailored for SMT.
    Page 8, “Experiments”

See all papers in Proc. ACL 2014 that mention word segmentation.

See all papers in Proc. ACL that mention word segmentation.

Back to top.

unlabeled data

Appears in 7 sentences as: unlabeled data (7)
In Toward Better Chinese Word Segmentation for SMT via Bilingual Constraints
  1. (2014), proposed GP for inferring the label information of unlabeled data , and then leverage these GP outcomes to learn a semi-supervised scalable model (e. g., CRFs).
    Page 2, “Related Work”
  2. One of our main objectives is to bias CRFs model’s learning on unlabeled data , under a nonlinear GP constraint encoding the bilingual knowledge.
    Page 2, “Related Work”
  3. (2008) described constraint driven learning (CODL) that augments model learning on unlabeled data by adding a cost for violating expectations of constraint features designed by domain knowledge.
    Page 3, “Related Work”
  4. Our learning problem belongs to semi-supervised learning (SSL), as the training is done on treebank labeled data (XL,YL) = {(X1,y1), ..., (Xl,yl)}, and bilingual unlabeled data (XU) 2 {X1, ..., Xu} where X,- = {531, ...,:cm} is an input word sequence and yi = {3/1, ...,ym}, y E T is its corresponding label sequence.
    Page 5, “Methodology”
  5. In our setting, the CRFs model is required to learn from unlabeled data .
    Page 5, “Methodology”
  6. This work employs the posterior regularization (PR) framework3 (Ganchev et al., 2010) to bias the CRFs model’s learning on unlabeled data , under a constraint encoded by the graph propagation expression.
    Page 5, “Methodology”
  7. by the character-based alignment (VES-NO-GP), and the graph propagation (VES-GP—PL), are regarded as virtual evidences to bias CRFs model’s learning on the unlabeled data .
    Page 7, “Experiments”

See all papers in Proc. ACL 2014 that mention unlabeled data.

See all papers in Proc. ACL that mention unlabeled data.

Back to top.

NIST

Appears in 5 sentences as: NIST (5)
In Toward Better Chinese Word Segmentation for SMT via Bilingual Constraints
  1. We adopted three state-of-the-art metrics, BLEU (Papineni et al., 2002), NIST (Doddington et al., 2000) and METEOR (Banerjee and Lavie, 2005), to evaluate the translation quality.
    Page 6, “Experiments”
  2. The NIST evaluation campaign data, MT—03 and MT-05, are selected to comprise the MT development data, devMT, and testing data, testMT, respectively.
    Page 6, “Experiments”
  3. NIST and METEOR over others.
    Page 8, “Experiments”
  4. Models BLEU NIST METEOR CS 29.38 59.85 54.07 SMS 30.05 61.33 55.95 UBS 30.15 61.56 55.39 Stanford 30.40 61.94 56.01
    Page 8, “Experiments”
  5. If these identical segmentations are removed, and the experiments are rerun, the translation scores decrease (on average) by 0.50, 0.85 and 0.70 on BLEU, NIST and METEOR, respectively.
    Page 8, “Experiments”

See all papers in Proc. ACL 2014 that mention NIST.

See all papers in Proc. ACL that mention NIST.

Back to top.

BLEU

Appears in 4 sentences as: BLEU (3) BLEU, (1)
In Toward Better Chinese Word Segmentation for SMT via Bilingual Constraints
  1. We adopted three state-of-the-art metrics, BLEU (Papineni et al., 2002), NIST (Doddington et al., 2000) and METEOR (Banerjee and Lavie, 2005), to evaluate the translation quality.
    Page 6, “Experiments”
  2. Overall, the boldface numbers in the last row illustrate that our model obtains average improvements of 1.89, 1.76 and 1.61 on BLEU,
    Page 7, “Experiments”
  3. Models BLEU NIST METEOR CS 29.38 59.85 54.07 SMS 30.05 61.33 55.95 UBS 30.15 61.56 55.39 Stanford 30.40 61.94 56.01
    Page 8, “Experiments”
  4. If these identical segmentations are removed, and the experiments are rerun, the translation scores decrease (on average) by 0.50, 0.85 and 0.70 on BLEU , NIST and METEOR, respectively.
    Page 8, “Experiments”

See all papers in Proc. ACL 2014 that mention BLEU.

See all papers in Proc. ACL that mention BLEU.

Back to top.

Chinese word

Appears in 4 sentences as: Chinese Word (1) Chinese word (2) Chinese words (1)
In Toward Better Chinese Word Segmentation for SMT via Bilingual Constraints
  1. This study investigates on building a better Chinese word segmentation model for statistical machine translation.
    Page 1, “Abstract”
  2. They leverage such mappings to either constitute a Chinese word dictionary for maximum-matching segmentation (Xu et al., 2004), or form labeled data for training a sequence labeling model (Paul et al., 2011).
    Page 1, “Introduction”
  3. This paper proposes an alternative Chinese Word Segmentation (CWS) model adapted to the SMT task, which seeks not only to maintain the advantages of a monolingual supervised model, having hand-annotated linguistic knowledge, but also to assimilate the relevant bilingual segmenta-
    Page 1, “Introduction”
  4. All other nine CWS models outperforms the CS baseline which does not try to identify Chinese words at all.
    Page 7, “Experiments”

See all papers in Proc. ACL 2014 that mention Chinese word.

See all papers in Proc. ACL that mention Chinese word.

Back to top.

hyperparameter

Appears in 4 sentences as: hyperparameter (3) hyperparameters (1)
In Toward Better Chinese Word Segmentation for SMT via Bilingual Constraints
  1. The hyperparameter A is used to control the impacts of the penalty term.
    Page 5, “Methodology”
  2. There are four hyperparameters in our model to be tuned by using the development data (devMT) among the following settings: for the graph propagation, ,0 E {0205,08} and p E {0.1,0.3,0.5,0.8}; for the PR learning, A E {0 g A,- g 1} and 0 E {0 S a, g 1} where the step is 0.1.
    Page 6, “Experiments”
  3. The optimal hyperparameter values were found to be: STS-NO-GP (04 = 0.8) and 77 = 0.6) and STS-GP-PL (,u = 0.5,,0 = 03,04 2 0.8 and 77 = 0.6).
    Page 7, “Experiments”
  4. The optimal hyperparameter values were found to be: VES-NO-GP (04 = 0.7) and VES-GP-PL (,u = 0.5, ,0 = 0.3 and 04 = 0.7).
    Page 7, “Experiments”

See all papers in Proc. ACL 2014 that mention hyperparameter.

See all papers in Proc. ACL that mention hyperparameter.

Back to top.

labeled data

Appears in 3 sentences as: labeled data (3)
In Toward Better Chinese Word Segmentation for SMT via Bilingual Constraints
  1. They leverage such mappings to either constitute a Chinese word dictionary for maximum-matching segmentation (Xu et al., 2004), or form labeled data for training a sequence labeling model (Paul et al., 2011).
    Page 1, “Introduction”
  2. Our learning problem belongs to semi-supervised learning (SSL), as the training is done on treebank labeled data (XL,YL) = {(X1,y1), ..., (Xl,yl)}, and bilingual unlabeled data (XU) 2 {X1, ..., Xu} where X,- = {531, ...,:cm} is an input word sequence and yi = {3/1, ...,ym}, y E T is its corresponding label sequence.
    Page 5, “Methodology”
  3. 0 Self-training Segmenters (STS): two variant models were defined by the approach reported in (Subramanya et al., 2010) that uses the supervised CRFs model’s decodings, incorporating empirical and constraint information, for unlabeled examples as additional labeled data to retrain a CRFs model.
    Page 7, “Experiments”

See all papers in Proc. ACL 2014 that mention labeled data.

See all papers in Proc. ACL that mention labeled data.

Back to top.

machine translation

Appears in 3 sentences as: machine translation (3)
In Toward Better Chinese Word Segmentation for SMT via Bilingual Constraints
  1. This study investigates on building a better Chinese word segmentation model for statistical machine translation .
    Page 1, “Abstract”
  2. The experiments on a Chinese-to-English machine translation task reveal that the proposed model can bring positive segmentation effects to translation quality.
    Page 1, “Abstract”
  3. The empirical works show that word segmentation can be beneficial to Chinese-to-English statistical machine translation (SMT) (Xu et al., 2005; Chang et al., 2008; Zhao et al., 2013).
    Page 1, “Introduction”

See all papers in Proc. ACL 2014 that mention machine translation.

See all papers in Proc. ACL that mention machine translation.

Back to top.

proposed model

Appears in 3 sentences as: proposed model (3)
In Toward Better Chinese Word Segmentation for SMT via Bilingual Constraints
  1. The experiments on a Chinese-to-English machine translation task reveal that the proposed model can bring positive segmentation effects to translation quality.
    Page 1, “Abstract”
  2. Section 4 reports the experimental results of the proposed model for a Chinese-to-English MT task.
    Page 2, “Introduction”
  3. The empirical results indicate that the proposed model can yield better segmentations for SMT.
    Page 9, “Conclusion”

See all papers in Proc. ACL 2014 that mention proposed model.

See all papers in Proc. ACL that mention proposed model.

Back to top.