Empirical Study of Unsupervised Chinese Word Segmentation Methods for SMT on Large-scale Corpora
Wang, Xiaolin and Utiyama, Masao and Finch, Andrew and Sumita, Eiichiro

Article Structure

Abstract

Unsupervised word segmentation (UWS) can provide domain-adaptive segmentation for statistical machine translation (SMT) without annotated data, and bilingual UWS can even optimize segmentation for alignment.

Introduction

Many languages, especially Asian languages such as Chinese, Japanese and Myanmar, have no explicit word boundaries, thus word segmentation (WS), that is, segmenting the continuous texts of these languages into isolated words, is a prerequisite for many natural language processing applications including SMT.

Methods

This section describes our unified monolingual and bilingual UWS scheme.

Complexity Analysis

The computational complexity of our method is linear in the number of iterations, the size of the corpus, and the complexity of calculating the expectations on each sentence or sentence pair.

Topics

segmenters

Appears in 11 sentences as: segmentations (3) segmenters (8)
In Empirical Study of Unsupervised Chinese Word Segmentation Methods for SMT on Large-scale Corpora
  1. Experimental results show that the proposed method is comparable to supervised segmenters on the in-domain NIST OpenMT corpus, and yields a 0.96 BLEU relative increase on NTCIR PatentMT corpus which is out-of-domain.
    Page 1, “Abstract”
  2. Though supervised-learning approaches which involve training segmenters on manually segmented corpora are widely used (Chang et al., 2008), yet the criteria for manually annotating words are arbitrary, and the available annotated corpora are limited in both quantity and genre variety.
    Page 1, “Introduction”
  3. The set .7: is chosen to represent an unsegmented foreign language sentence (a sequence of characters), because an unsegmented sentence can be seen as the set of all possible segmentations of the sentence denoted F, i.e.
    Page 2, “Methods”
  4. Character-based segmentation, LDC segmenter and Stanford Chinese segmenters were used as the baseline methods.
    Page 4, “Complexity Analysis”
  5. The training was started from assuming that there was no previous segmentations on each sentence (pair), and the number of iterations was fixed.
    Page 5, “Complexity Analysis”
  6. The monolingual bigram model, however, was slower to converge, so we started it from the segmentations of the unigram model, and using 10 iterations.
    Page 5, “Complexity Analysis”
  7. The experimental results show that the proposed UWS methods are comparable to the Stanford segmenters on the OpenMT06 corpus, while achieves a 0.96 BLEU increase on the PatentMT9 corpus.
    Page 5, “Complexity Analysis”
  8. This is because this corpus is out-of-domain for the supervised segmenters .
    Page 5, “Complexity Analysis”
  9. The proposed method does not require any annotated data, but the SMT system with it can achieve comparable performance compared to state-of-the-art supervised word segmenters trained on precious annotated data.
    Page 5, “Complexity Analysis”
  10. Moreover, the proposed method yields 0.96 BLEU improvement relative to supervised word segmenters on an out-of-domain corpus.
    Page 5, “Complexity Analysis”
  11. Thus, we believe that the proposed method would benefit SMT related to low-resource languages where annotated data are scare, and would also find application in domains that differ too greatly from the domains on which supervised word segmenters were trained.
    Page 5, “Complexity Analysis”

See all papers in Proc. ACL 2014 that mention segmenters.

See all papers in Proc. ACL that mention segmenters.

Back to top.

BLEU

Appears in 9 sentences as: BLEU (9)
In Empirical Study of Unsupervised Chinese Word Segmentation Methods for SMT on Large-scale Corpora
  1. Experimental results show that the proposed method is comparable to supervised segmenters on the in-domain NIST OpenMT corpus, and yields a 0.96 BLEU relative increase on NTCIR PatentMT corpus which is out-of-domain.
    Page 1, “Abstract”
  2. o improvement of BLEU scores compared to supervised Stanford Chinese word segmenter.
    Page 2, “Introduction”
  3. In this section, the proposed method is first validated on monolingual segmentation tasks, and then evaluated in the context of SMT to study whether the translation quality, measured by BLEU , can be improved.
    Page 4, “Complexity Analysis”
  4. For the bilingual tasks, the publicly available system of Moses (Koehn et al., 2007) with default settings is employed to perform machine translation, and BLEU (Papineni et al., 2002) was used to evaluate the quality.
    Page 4, “Complexity Analysis”
  5. It was set to 3 for the monolingual unigram model, and 2 for the bilingual unigram model, which provided slightly higher BLEU scores on the development set than the other settings.
    Page 5, “Complexity Analysis”
  6. Table 4 presents the BLEU scores for Moses using different segmentation methods.
    Page 5, “Complexity Analysis”
  7. The experimental results show that the proposed UWS methods are comparable to the Stanford segmenters on the OpenMT06 corpus, while achieves a 0.96 BLEU increase on the PatentMT9 corpus.
    Page 5, “Complexity Analysis”
  8. Method BLEU
    Page 5, “Complexity Analysis”
  9. Moreover, the proposed method yields 0.96 BLEU improvement relative to supervised word segmenters on an out-of-domain corpus.
    Page 5, “Complexity Analysis”

See all papers in Proc. ACL 2014 that mention BLEU.

See all papers in Proc. ACL that mention BLEU.

Back to top.

bigram

Appears in 7 sentences as: bigram (7)
In Empirical Study of Unsupervised Chinese Word Segmentation Methods for SMT on Large-scale Corpora
  1. For the monolingual bigram model, the number of states in the HMM is U times more than that of the monolingual unigram model, as the states at specific position of F are not only related to the length of the current word, but also related to the length of the word before it.
    Page 4, “Complexity Analysis”
  2. NPY( bigram )a 0.750 0.802 17 m —NPY(trigram)a 0.757 0.807
    Page 5, “Complexity Analysis”
  3. HDP( bigram )b 0.723 — 10 h —FitnessC — 0.667 — —Prop.
    Page 5, “Complexity Analysis”
  4. ( bigram ) 0.774 0.806 15 s 2530 s
    Page 5, “Complexity Analysis”
  5. The monolingual bigram model, however, was slower to converge, so we started it from the segmentations of the unigram model, and using 10 iterations.
    Page 5, “Complexity Analysis”
  6. In monolingual segmentation, the proposed methods with both unigram and bigram models were tested.
    Page 5, “Complexity Analysis”
  7. The proposed method with monolingual bigram model performed poorly on the Chinese monolingual segmentation task; thus, it was not tested.
    Page 5, “Complexity Analysis”

See all papers in Proc. ACL 2014 that mention bigram.

See all papers in Proc. ACL that mention bigram.

Back to top.

unigram

Appears in 7 sentences as: unigram (8)
In Empirical Study of Unsupervised Chinese Word Segmentation Methods for SMT on Large-scale Corpora
  1. This section uses a unigram model for description convenience, but the method can be extended to n-gram models.
    Page 3, “Methods”
  2. For the monolingual bigram model, the number of states in the HMM is U times more than that of the monolingual unigram model, as the states at specific position of F are not only related to the length of the current word, but also related to the length of the word before it.
    Page 4, “Complexity Analysis”
  3. Thus its complexity is U 2 times the unigram model’s complexity:
    Page 4, “Complexity Analysis”
  4. ( unigram ) 0.729 0.804 3 s 50 s Prop.
    Page 5, “Complexity Analysis”
  5. It was set to 3 for the monolingual unigram model, and 2 for the bilingual unigram model, which provided slightly higher BLEU scores on the development set than the other settings.
    Page 5, “Complexity Analysis”
  6. The monolingual bigram model, however, was slower to converge, so we started it from the segmentations of the unigram model, and using 10 iterations.
    Page 5, “Complexity Analysis”
  7. In monolingual segmentation, the proposed methods with both unigram and bigram models were tested.
    Page 5, “Complexity Analysis”

See all papers in Proc. ACL 2014 that mention unigram.

See all papers in Proc. ACL that mention unigram.

Back to top.

word segmenters

Appears in 6 sentences as: word segmentation (2) word segmenter (1) word segmenters (3)
In Empirical Study of Unsupervised Chinese Word Segmentation Methods for SMT on Large-scale Corpora
  1. Unsupervised word segmentation (UWS) can provide domain-adaptive segmentation for statistical machine translation (SMT) without annotated data, and bilingual UWS can even optimize segmentation for alignment.
    Page 1, “Abstract”
  2. Many languages, especially Asian languages such as Chinese, Japanese and Myanmar, have no explicit word boundaries, thus word segmentation (WS), that is, segmenting the continuous texts of these languages into isolated words, is a prerequisite for many natural language processing applications including SMT.
    Page 1, “Introduction”
  3. o improvement of BLEU scores compared to supervised Stanford Chinese word segmenter .
    Page 2, “Introduction”
  4. The proposed method does not require any annotated data, but the SMT system with it can achieve comparable performance compared to state-of-the-art supervised word segmenters trained on precious annotated data.
    Page 5, “Complexity Analysis”
  5. Moreover, the proposed method yields 0.96 BLEU improvement relative to supervised word segmenters on an out-of-domain corpus.
    Page 5, “Complexity Analysis”
  6. Thus, we believe that the proposed method would benefit SMT related to low-resource languages where annotated data are scare, and would also find application in domains that differ too greatly from the domains on which supervised word segmenters were trained.
    Page 5, “Complexity Analysis”

See all papers in Proc. ACL 2014 that mention word segmenters.

See all papers in Proc. ACL that mention word segmenters.

Back to top.

machine translation

Appears in 5 sentences as: machine translation (5)
In Empirical Study of Unsupervised Chinese Word Segmentation Methods for SMT on Large-scale Corpora
  1. Unsupervised word segmentation (UWS) can provide domain-adaptive segmentation for statistical machine translation (SMT) without annotated data, and bilingual UWS can even optimize segmentation for alignment.
    Page 1, “Abstract”
  2. For example, in machine translation , there are various parallel corpora such as
    Page 1, “Introduction”
  3. The first bilingual corpus: OpenMT06 was used in the NIST open machine translation 2006 Evaluation 2.
    Page 4, “Complexity Analysis”
  4. PatentMT9 is from the shared task of NTCIR-9 patent machine translation .
    Page 4, “Complexity Analysis”
  5. For the bilingual tasks, the publicly available system of Moses (Koehn et al., 2007) with default settings is employed to perform machine translation , and BLEU (Papineni et al., 2002) was used to evaluate the quality.
    Page 4, “Complexity Analysis”

See all papers in Proc. ACL 2014 that mention machine translation.

See all papers in Proc. ACL that mention machine translation.

Back to top.

BLEU scores

Appears in 3 sentences as: BLEU scores (3)
In Empirical Study of Unsupervised Chinese Word Segmentation Methods for SMT on Large-scale Corpora
  1. o improvement of BLEU scores compared to supervised Stanford Chinese word segmenter.
    Page 2, “Introduction”
  2. It was set to 3 for the monolingual unigram model, and 2 for the bilingual unigram model, which provided slightly higher BLEU scores on the development set than the other settings.
    Page 5, “Complexity Analysis”
  3. Table 4 presents the BLEU scores for Moses using different segmentation methods.
    Page 5, “Complexity Analysis”

See all papers in Proc. ACL 2014 that mention BLEU scores.

See all papers in Proc. ACL that mention BLEU scores.

Back to top.

dynamic programming

Appears in 3 sentences as: (1) dynamic programming (2)
In Empirical Study of Unsupervised Chinese Word Segmentation Methods for SMT on Large-scale Corpora
  1. P(.7-",f/|.7:, M) is the marginal probability of all the possible F E .7: that contain .735, as a word, which can be calculated efficiently through dynamic programming (the process is similar to the foreward-backward algorithm in training a hidden Markov model (HMM) (Rabiner, 1989)):
    Page 2, “Methods”
  2. Then, the previous dynamic programming method can be extended to the bilingual expectation
    Page 3, “Methods”
  3. where K is the number of characters in .73, and the k-th character is the start of the word fj, since 7' and J are unknown during the computation of dynamic programming .
    Page 3, “Methods”

See all papers in Proc. ACL 2014 that mention dynamic programming.

See all papers in Proc. ACL that mention dynamic programming.

Back to top.

Gibbs sampling

Appears in 3 sentences as: (1) Gibbs sampling (2)
In Empirical Study of Unsupervised Chinese Word Segmentation Methods for SMT on Large-scale Corpora
  1. (2010) used the local best alignment to increase the speed of the Gibbs sampling in training but the impact on accuracy was not explored.
    Page 1, “Introduction”
  2. To this end, we model bilingual UWS under a similar framework with monolingual UWS in order to improve efficiency, and replace Gibbs sampling with expectation maximization (EM) in training.
    Page 1, “Introduction”
  3. EF/{;}(P(.7-"k/|.7-")) = P(J:k'|f, M) in a similar manner to the marginalization in the Gibbs sampling process which we are replacing;
    Page 2, “Methods”

See all papers in Proc. ACL 2014 that mention Gibbs sampling.

See all papers in Proc. ACL that mention Gibbs sampling.

Back to top.

NIST

Appears in 3 sentences as: NIST (3)
In Empirical Study of Unsupervised Chinese Word Segmentation Methods for SMT on Large-scale Corpora
  1. Experimental results show that the proposed method is comparable to supervised segmenters on the in-domain NIST OpenMT corpus, and yields a 0.96 BLEU relative increase on NTCIR PatentMT corpus which is out-of-domain.
    Page 1, “Abstract”
  2. The first bilingual corpus: OpenMT06 was used in the NIST open machine translation 2006 Evaluation 2.
    Page 4, “Complexity Analysis”
  3. The data sets of NIST Eval 2002 to 2005 were used as the development for MERT tuning (Och, 2003).
    Page 4, “Complexity Analysis”

See all papers in Proc. ACL 2014 that mention NIST.

See all papers in Proc. ACL that mention NIST.

Back to top.

segmentation model

Appears in 3 sentences as: segmentation model (3)
In Empirical Study of Unsupervised Chinese Word Segmentation Methods for SMT on Large-scale Corpora
  1. M monolingual segmentation model
    Page 2, “Methods”
  2. 13 bilingual segmentation model
    Page 2, “Methods”
  3. segmentation model M or B.
    Page 3, “Methods”

See all papers in Proc. ACL 2014 that mention segmentation model.

See all papers in Proc. ACL that mention segmentation model.

Back to top.

Appears in 3 sentences as: (1) sentence pair (1) sentence pairs (1)
In Empirical Study of Unsupervised Chinese Word Segmentation Methods for SMT on Large-scale Corpora
  1. (.7-",E) a bilingual sentence pair
    Page 2, “Methods”
  2. The computational complexity of our method is linear in the number of iterations, the size of the corpus, and the complexity of calculating the expectations on each sentence or sentence pair .
    Page 4, “Complexity Analysis”
  3. This was verified by experiments on a corpus of 1-million sentence pairs on which traditional MCMC approaches would struggle (Xu et al., 2008).
    Page 5, “Complexity Analysis”

See all papers in Proc. ACL 2014 that mention .

See all papers in Proc. ACL that mention .

Back to top.