An improved MDL-based compression algorithm for unsupervised word segmentation
Chen, Ruey-Cheng

Article Structure

Abstract

We study the mathematical properties of a recently proposed MDL-based unsupervised word segmentation algorithm, called regularized compression.

Introduction

Hierarchical Bayes methods have been mainstream in unsupervised word segmentation since the dawn of hierarchical Dirichlet process (Goldwater et al., 2009) and adaptors grammar (Johnson and Goldwater, 2009).

Regularized Compression

The dynamics behind regularized compression is similar to digram coding (Witten et al., 1999).

Change in Description Length

The second term of the aforementioned objective is in fact an approximate to the change in description length.

Proposed Method

Based on this finding, we propose the following two variations (see Figure l) for the regularized compression framework:

Evaluation

5.1 Setup

Concluding Remarks

In this paper, we derive a new lower-bound approximate to the objective function used in the regularized compression algorithm.

Topics

bigram

Appears in 5 sentences as: bigram (4) bigrams (1)
In An improved MDL-based compression algorithm for unsupervised word segmentation
  1. Hence, a new sequence IV,-is created in the i-th iteration by merging all the occurrences of some selected bigram (cc, 3/) in the original sequence Wi_1.
    Page 1, “Regularized Compression”
  2. Note that f(:c, y) is the bigram frequency, |Wi_1| the sequence length of VVi_1, and AH(W,-_1, m) = FI(W,-) — H(W,-_1) is the difference between the empirical Shannon entropy measured on 1% and VVi_1, using maximum likelihood estimates.
    Page 1, “Regularized Compression”
  3. In the new sequence 1%, each occurrence of the 53-3; bigram is replaced with a new (conceptually unseen) word 2.
    Page 2, “Regularized Compression”
  4. Suppose that the original sequence VVz-_1 is N -word long, the selected word type pair cc and 3/ each occurs k and 1 times, respectively, and altogether x-y bigram occurs m times in VVz-_1.
    Page 2, “Change in Description Length”
  5. In the new sequence Wi, each of the m bigrams is replaced with an unseen word 2 = my.
    Page 2, “Change in Description Length”

See all papers in Proc. ACL 2013 that mention bigram.

See all papers in Proc. ACL that mention bigram.

Back to top.

F-measure

Appears in 5 sentences as: F-measure (5)
In An improved MDL-based compression algorithm for unsupervised word segmentation
  1. Segmentation performance is measured using word-level precision (P), recall (R), and F-measure (F).
    Page 3, “Evaluation”
  2. We found that, in all three settings, G2 outperforms the baseline by 1 to 2 percentage points in F-measure .
    Page 3, “Evaluation”
  3. The best performance result achieved by G2 in our experiment is 81.7 in word-level F-measure , although this was obtained from search setting (c), using a heuristic p value 0.37.
    Page 3, “Evaluation”
  4. It would be interesting to confirm this by studying the correlation between description length and word-level F-measure .
    Page 4, “Evaluation”
  5. Using MDL alone, one proposed method outperforms the original regularized compressor (Chen et al., 2012) in precision by 2 percentage points and in F-measure by 1.
    Page 4, “Concluding Remarks”

See all papers in Proc. ACL 2013 that mention F-measure.

See all papers in Proc. ACL that mention F-measure.

Back to top.

objective function

Appears in 3 sentences as: objective function (3)
In An improved MDL-based compression algorithm for unsupervised word segmentation
  1. Our analysis shows that its objective function can be efficiently approximated using the negative empirical pointwise mutual information.
    Page 1, “Abstract”
  2. The new objective function is written out as Equation (4).
    Page 2, “Proposed Method”
  3. In this paper, we derive a new lower-bound approximate to the objective function used in the regularized compression algorithm.
    Page 4, “Concluding Remarks”

See all papers in Proc. ACL 2013 that mention objective function.

See all papers in Proc. ACL that mention objective function.

Back to top.

word segmentation

Appears in 3 sentences as: word segmentation (3)
In An improved MDL-based compression algorithm for unsupervised word segmentation
  1. We study the mathematical properties of a recently proposed MDL-based unsupervised word segmentation algorithm, called regularized compression.
    Page 1, “Abstract”
  2. Hierarchical Bayes methods have been mainstream in unsupervised word segmentation since the dawn of hierarchical Dirichlet process (Goldwater et al., 2009) and adaptors grammar (Johnson and Goldwater, 2009).
    Page 1, “Introduction”
  3. A natural extension of this work is to reproduce this result on some other word segmentation benchmarks, specifically those in other Asian languages (Emerson, 2005; Zhikov et al., 2010).
    Page 4, “Concluding Remarks”

See all papers in Proc. ACL 2013 that mention word segmentation.

See all papers in Proc. ACL that mention word segmentation.

Back to top.

word-level

Appears in 3 sentences as: word-level (3)
In An improved MDL-based compression algorithm for unsupervised word segmentation
  1. Segmentation performance is measured using word-level precision (P), recall (R), and F-measure (F).
    Page 3, “Evaluation”
  2. The best performance result achieved by G2 in our experiment is 81.7 in word-level F-measure, although this was obtained from search setting (c), using a heuristic p value 0.37.
    Page 3, “Evaluation”
  3. It would be interesting to confirm this by studying the correlation between description length and word-level F-measure.
    Page 4, “Evaluation”

See all papers in Proc. ACL 2013 that mention word-level.

See all papers in Proc. ACL that mention word-level.

Back to top.