An improved MDL-based compression algorithm for unsupervised word segmentation

We study the mathematical properties of a recently proposed MDL-based unsupervised word segmentation algorithm, called regularized compression.

Hierarchical Bayes methods have been mainstream in unsupervised word segmentation since the dawn of hierarchical Dirichlet process (Goldwater et al., 2009) and adaptors grammar (Johnson and Goldwater, 2009).

The dynamics behind regularized compression is similar to digram coding (Witten et al., 1999).

The second term of the aforementioned objective is in fact an approximate to the change in description length.

Based on this finding, we propose the following two variations (see Figure l) for the regularized compression framework:

5.1 Setup

In this paper, we derive a new lower-bound approximate to the objective function used in the regularized compression algorithm.

Appears in 5 sentences as: bigram (4) bigrams (1)

In *An improved MDL-based compression algorithm for unsupervised word segmentation*

- Hence, a new sequence IV,-is created in the i-th iteration by merging all the occurrences of some selected bigram (cc, 3/) in the original sequence Wi_1.Page 1, “Regularized Compression”
- Note that f(:c, y) is the bigram frequency, |Wi_1| the sequence length of VVi_1, and AH(W,-_1, m) = FI(W,-) — H(W,-_1) is the difference between the empirical Shannon entropy measured on 1% and VVi_1, using maximum likelihood estimates.Page 1, “Regularized Compression”
- In the new sequence 1%, each occurrence of the 53-3; bigram is replaced with a new (conceptually unseen) word 2.Page 2, “Regularized Compression”
- Suppose that the original sequence VVz-_1 is N -word long, the selected word type pair cc and 3/ each occurs k and 1 times, respectively, and altogether x-y bigram occurs m times in VVz-_1.Page 2, “Change in Description Length”
- In the new sequence Wi, each of the m bigrams is replaced with an unseen word 2 = my.Page 2, “Change in Description Length”

See all papers in *Proc. ACL 2013* that mention bigram.

See all papers in *Proc. ACL* that mention bigram.

Back to top.

Appears in 5 sentences as: F-measure (5)

In *An improved MDL-based compression algorithm for unsupervised word segmentation*

- Segmentation performance is measured using word-level precision (P), recall (R), and F-measure (F).Page 3, “Evaluation”
- We found that, in all three settings, G2 outperforms the baseline by 1 to 2 percentage points in F-measure .Page 3, “Evaluation”
- The best performance result achieved by G2 in our experiment is 81.7 in word-level F-measure , although this was obtained from search setting (c), using a heuristic p value 0.37.Page 3, “Evaluation”
- It would be interesting to confirm this by studying the correlation between description length and word-level F-measure .Page 4, “Evaluation”
- Using MDL alone, one proposed method outperforms the original regularized compressor (Chen et al., 2012) in precision by 2 percentage points and in F-measure by 1.Page 4, “Concluding Remarks”

See all papers in *Proc. ACL 2013* that mention F-measure.

See all papers in *Proc. ACL* that mention F-measure.

Back to top.

Appears in 3 sentences as: objective function (3)

In *An improved MDL-based compression algorithm for unsupervised word segmentation*

- Our analysis shows that its objective function can be efficiently approximated using the negative empirical pointwise mutual information.Page 1, “Abstract”
- The new objective function is written out as Equation (4).Page 2, “Proposed Method”
- In this paper, we derive a new lower-bound approximate to the objective function used in the regularized compression algorithm.Page 4, “Concluding Remarks”

See all papers in *Proc. ACL 2013* that mention objective function.

See all papers in *Proc. ACL* that mention objective function.

Back to top.

Appears in 3 sentences as: word segmentation (3)

In *An improved MDL-based compression algorithm for unsupervised word segmentation*

- We study the mathematical properties of a recently proposed MDL-based unsupervised word segmentation algorithm, called regularized compression.Page 1, “Abstract”
- Hierarchical Bayes methods have been mainstream in unsupervised word segmentation since the dawn of hierarchical Dirichlet process (Goldwater et al., 2009) and adaptors grammar (Johnson and Goldwater, 2009).Page 1, “Introduction”
- A natural extension of this work is to reproduce this result on some other word segmentation benchmarks, specifically those in other Asian languages (Emerson, 2005; Zhikov et al., 2010).Page 4, “Concluding Remarks”

See all papers in *Proc. ACL 2013* that mention word segmentation.

See all papers in *Proc. ACL* that mention word segmentation.

Back to top.

Appears in 3 sentences as: word-level (3)

In *An improved MDL-based compression algorithm for unsupervised word segmentation*

- Segmentation performance is measured using word-level precision (P), recall (R), and F-measure (F).Page 3, “Evaluation”
- The best performance result achieved by G2 in our experiment is 81.7 in word-level F-measure, although this was obtained from search setting (c), using a heuristic p value 0.37.Page 3, “Evaluation”
- It would be interesting to confirm this by studying the correlation between description length and word-level F-measure.Page 4, “Evaluation”

See all papers in *Proc. ACL 2013* that mention word-level.

See all papers in *Proc. ACL* that mention word-level.

Back to top.