Bayesian Unsupervised Word Segmentation with Nested Pitman-Yor Language Modeling

In this paper, we propose a new Bayesian model for fully unsupervised word segmentation and an efficient blocked Gibbs sampler combined with dynamic programming for inference.

“Word” is no trivial concept in many languages.

To compute a probability p(w|s) in (l), we adopt a Bayesian language model lately proposed by (Teh, 2006b; Goldwater et al., 2005) based on the Pitman-Yor process, a generalization of the Dirichlet process.

Thus far we have assumed that the unigram G1 is already given, but of course it should also be generated as G1 ~ PY(G0, d, 6).

To find the hidden word segmentation w of a string 3 = 01 - - - c N, which is equivalent to the vector of binary hidden variables 2 = 21 - - - ZN, the simplest approach is to build a Gibbs sampler that randomly selects a character c,- and draw a binary decision 2,- as to whether there is a word boundary, and then update the language model according to the new segmentation (Goldwater et al., 2006; Xu et al., 2008).

To validate our model, we conducted experiments on standard datasets for Chinese and Japanese word segmentation that are publicly available, as well as the same dataset used in (Goldwater et al., 2006).

In retrospect, our NPYLM is essentially a hierarchical Markov model where the units (=words) evolve as the Markov process, and each unit has subunits (=characters) that also evolve as the Markov process.

In this paper, we proposed a much more efficient and accurate model for fully unsupervised word segmentation.

Appears in 25 sentences as: Word segmentation (1) word segmentation (24)

In *Bayesian Unsupervised Word Segmentation with Nested Pitman-Yor Language Modeling*

- In this paper, we propose a new Bayesian model for fully unsupervised word segmentation and an efficient blocked Gibbs sampler combined with dynamic programming for inference.Page 1, “Abstract”
- We confirmed that it significantly outperforms previous reported results in both phonetic transcripts and standard datasets for Chinese and Japanese word segmentation .Page 1, “Abstract”
- Asian languages such as Chinese and Japanese have no explicit word boundaries, thus word segmentation is a crucial first step when processing them.Page 1, “Introduction”
- In order to extract “words” from text streams, unsupervised word segmentation is an important research area because the criteria for creating supervised training data could be arbitrary, and will be suboptimal for applications that rely on segmentations.Page 1, “Introduction”
- This maximizes the probability of word segmentation w given a string 3 :Page 1, “Introduction”
- This approach often implicitly includes heuristic criteria proposed so farl, while having a clear statistical semantics to find the most probable word segmentation that will maximize the probability of the data, here the strings.Page 1, “Introduction”
- In this paper, we extend this work to propose a more efficient and accurate unsupervised word segmentation that will optimize the performance of the word n-gram Pitman-Yor (i.e.Page 1, “Introduction”
- By embedding a character n-gram in word n-gram from a Bayesian perspective, Section 3 introduces a novel language model for word segmentation , which we call the Nested Pitman-Yor language model.Page 2, “Introduction”
- If a lexicon is finite, we can use a uniform prior G0(w) = l/|V| for every word 21) in lexicon V. However, with word segmentation every substring could be a word, thus the lexicon is not limited but will be countably infinite.Page 3, “Nested Pitman-Yor Language Model”
- Building an accurate G0 is crucial for word segmentation , since it determines how the possible words will look like.Page 3, “Nested Pitman-Yor Language Model”
- To find the hidden word segmentation w of a string 3 = 01 - - - c N, which is equivalent to the vector of binary hidden variables 2 = 21 - - - ZN, the simplest approach is to build a Gibbs sampler that randomly selects a character c,- and draw a binary decision 2,- as to whether there is a word boundary, and then update the language model according to the new segmentation (Goldwater et al., 2006; Xu et al., 2008).Page 4, “Inference”

See all papers in *Proc. ACL 2009* that mention word segmentation.

See all papers in *Proc. ACL* that mention word segmentation.

Back to top.

Appears in 18 sentences as: Language Model (2) language model (19)

In *Bayesian Unsupervised Word Segmentation with Nested Pitman-Yor Language Modeling*

- Our model is a nested hierarchical Pitman-Yor language model , where Pitman-Yor spelling model is embedded in the word model.Page 1, “Abstract”
- Our model is also considered as a way to construct an accurate word n-gram language model directly from characters of arbitrary language, without any “wor ” indications.Page 1, “Abstract”
- Bayesian Kneser—Ney) language model , with an accurate character oo-gram Pitman-Yor spelling model embedded in word models.Page 1, “Introduction”
- Furthermore, it can be viewed as a method for building a high-performance n-gram language model directly from character strings of arbitrary language.Page 1, “Introduction”
- we briefly describe a language model based on the Pitman-Yor process (Teh, 2006b), which is a generalization of the Dirichlet process used in previous research.Page 2, “Introduction”
- By embedding a character n-gram in word n-gram from a Bayesian perspective, Section 3 introduces a novel language model for word segmentation, which we call the Nested Pitman-Yor language model .Page 2, “Introduction”
- To compute a probability p(w|s) in (l), we adopt a Bayesian language model lately proposed by (Teh, 2006b; Goldwater et al., 2005) based on the Pitman-Yor process, a generalization of the Dirichlet process.Page 2, “Pitman-Yor process and n-gram models”
- As a result, the n-gram probability of this hierarchical Pitman—Yor language model (HPYLM) is recursively computed asPage 3, “Pitman-Yor process and n-gram models”
- When we set thw E l, (4) recovers a Kneser-Ney smoothing: thus a HPYLM is a Bayesian Kneser—Ney language model as well as an extension of the hierarchical Dirichlet Process (HDP) used in Goldwater et al.Page 3, “Pitman-Yor process and n-gram models”
- In contrast, in this paper we use a simple but more elaborate model, that is, a character n-gram language model that also employs HPYLM.Page 3, “Nested Pitman-Yor Language Model”
- Figure 2: Chinese restaurant representation of our Nested Pitman-Yor Language Model (NPYLM).Page 3, “Nested Pitman-Yor Language Model”

See all papers in *Proc. ACL 2009* that mention language model.

See all papers in *Proc. ACL* that mention language model.

Back to top.

Appears in 13 sentences as: n-gram (15)

In *Bayesian Unsupervised Word Segmentation with Nested Pitman-Yor Language Modeling*

- Our model is also considered as a way to construct an accurate word n-gram language model directly from characters of arbitrary language, without any “wor ” indications.Page 1, “Abstract”
- In this paper, we extend this work to propose a more efficient and accurate unsupervised word segmentation that will optimize the performance of the word n-gram Pitman-Yor (i.e.Page 1, “Introduction”
- Furthermore, it can be viewed as a method for building a high-performance n-gram language model directly from character strings of arbitrary language.Page 1, “Introduction”
- By embedding a character n-gram in word n-gram from a Bayesian perspective, Section 3 introduces a novel language model for word segmentation, which we call the Nested Pitman-Yor language model.Page 2, “Introduction”
- In this representation, each n-gram context h (including the null context 6 for unigrams) is a Chinese restaurant whose customers are the n-gram counts seated over the tables 1 - - -thw.Page 2, “Pitman-Yor process and n-gram models”
- As a result, the n-gram probability of this hierarchical Pitman—Yor language model (HPYLM) is recursively computed asPage 3, “Pitman-Yor process and n-gram models”
- In contrast, in this paper we use a simple but more elaborate model, that is, a character n-gram language model that also employs HPYLM.Page 3, “Nested Pitman-Yor Language Model”
- To avoid dependency on n-gram order n, we actually used the oo-gram language model (Mochihashi and Sumita, 2007), a variable order HPYLM, for characters.Page 3, “Nested Pitman-Yor Language Model”
- character n-gram model explicit as 9 we can setPage 5, “Inference”
- where p(cl---ck,k|@) is an n-gram probability given by (6), and p(l<:|@) is a probability that a word of length k will be generated from 9.Page 5, “Inference”
- NPY(n) means n-gram NPYLM.Page 6, “Experiments”

See all papers in *Proc. ACL 2009* that mention n-gram.

See all papers in *Proc. ACL* that mention n-gram.

Back to top.

Appears in 12 sentences as: Segmentations (1) segmentations (11)

In *Bayesian Unsupervised Word Segmentation with Nested Pitman-Yor Language Modeling*

- In order to extract “words” from text streams, unsupervised word segmentation is an important research area because the criteria for creating supervised training data could be arbitrary, and will be suboptimal for applications that rely on segmentations .Page 1, “Introduction”
- It is particularly difficult to create “correct” training data for speech transcripts, colloquial texts, and classics where segmentations are often ambiguous, let alone is impossible for unknown languages whose properties computational linguists might seek to uncover.Page 1, “Introduction”
- When we repeat this process, it is expected to mix rapidly because it implicitly considers all possible segmentations of the given string at the same time.Page 4, “Inference”
- Segmentations before the final k characters are marginalized using the following recursive relationship:Page 4, “Inference”
- Figure 4: Forward filtering of a[t] to marginalize out possible segmentations j before 75— k.Page 5, “Inference”
- Japanese word segmentation, with all supervised segmentations removed in advance.Page 6, “Experiments”
- Semi-supervised results used only 10K sentences (1/5) of supervised segmentations .Page 7, “Experiments”
- segmentations .Page 7, “Experiments”
- The result also shows that the supervised segmentations are suboptimal with respect to the perplexity per character, and even worse than unsupervised results.Page 7, “Experiments”
- In semi-supervised case, using only 10K reference segmentations gives a performance of around 90% accuracy and the lowest perplexity, thanks to a combination with unsupervised data in a principled fashion.Page 7, “Experiments”
- The inferred segmentations arePage 7, “Experiments”

See all papers in *Proc. ACL 2009* that mention segmentations.

See all papers in *Proc. ACL* that mention segmentations.

Back to top.

Appears in 11 sentences as: Gibbs Sampler (1) Gibbs sampler (9) Gibbs sampling (1)

In *Bayesian Unsupervised Word Segmentation with Nested Pitman-Yor Language Modeling*

- In this paper, we propose a new Bayesian model for fully unsupervised word segmentation and an efficient blocked Gibbs sampler combined with dynamic programming for inference.Page 1, “Abstract”
- However, they are still na‘1've with respect to word spellings, and the inference is very slow owing to inefficient Gibbs sampling .Page 1, “Introduction”
- Section 4 describes an efficient blocked Gibbs sampler that leverages dynamic programming for inference.Page 2, “Introduction”
- To find the hidden word segmentation w of a string 3 = 01 - - - c N, which is equivalent to the vector of binary hidden variables 2 = 21 - - - ZN, the simplest approach is to build a Gibbs sampler that randomly selects a character c,- and draw a binary decision 2,- as to whether there is a word boundary, and then update the language model according to the new segmentation (Goldwater et al., 2006; Xu et al., 2008).Page 4, “Inference”
- 4.1 Blocked Gibbs samplerPage 4, “Inference”
- Instead, we propose a sentence-wise Gibbs sampler of word segmentation using efficient dynamic programming, as shown in Figure 3.Page 4, “Inference”
- Figure 3: Blocked Gibbs Sampler of NPYLM 9.Page 4, “Inference”
- This is called a blocked Gibbs sampler that samples z block-wise for each sentence.Page 4, “Inference”
- Since our algorithm converges rather fast, we ran the Gibbs sampler of trigram NPYLM for 200 iterations to obtain the results in Table 1.Page 6, “Experiments”
- In all cases we removed all whitespaces to yield raw character strings for inference, and set L = 4 for Chinese and L = 8 for Japanese to run the Gibbs sampler for 400 iterations.Page 6, “Experiments”
- 9Notice that analyzing a test data is not easy for character-wise Gibbs sampler of previous work.Page 6, “Experiments”

See all papers in *Proc. ACL 2009* that mention Gibbs sampler.

See all papers in *Proc. ACL* that mention Gibbs sampler.

Back to top.

Appears in 10 sentences as: Bigram (1) bigram (5) bigrams (4)

In *Bayesian Unsupervised Word Segmentation with Nested Pitman-Yor Language Modeling*

- Crucially, since they rely on sampling a word boundary between two neighboring words, they can leverage only up to bigram word dependencies.Page 1, “Introduction”
- The bigram distribution G2 = {Page 2, “Pitman-Yor process and n-gram models”
- Furthermore, it has an inherent limitation that it cannot deal with larger than bigrams , because it uses only local statistics between directly contiguous words for word segmentation.Page 4, “Inference”
- It has an additional advantage in that we can accommodate higher-order relationships than bigrams , particularly trigrams, for word segmentation.Page 4, “Inference”
- For this purpose, we maintain a forward variable a[t] in the bigram case.Page 4, “Inference”
- Figure 5: Forward-Backward sampling of word segmentation w. (in bigram case)Page 5, “Inference”
- For simplicity, we showed the algorithm for bigrams above.Page 5, “Inference”
- Complexity This algorithm has a complexity of 0(NL2) for bigrams and 0(NL3) for trigrams for each sentence, where N is the length of the sentence and L is the maximum allowed length of a word (3 N).Page 5, “Inference”
- Bigram and trigram performances are similar for Chinese, but trigram performs better for Japanese.Page 7, “Experiments”
- In fact, although the difference in perplexity per character is not so large, the perplexity per word is radically reduced: 439.8 ( bigram ) to 190.1 (trigram).Page 7, “Experiments”

See all papers in *Proc. ACL 2009* that mention bigram.

See all papers in *Proc. ACL* that mention bigram.

Back to top.

Appears in 7 sentences as: unigram (3) unigrams (4)

In *Bayesian Unsupervised Word Segmentation with Nested Pitman-Yor Language Modeling*

- Suppose we have a unigram word distribution G1 = } where - ranges over each word in the lexicon.Page 2, “Pitman-Yor process and n-gram models”
- In this representation, each n-gram context h (including the null context 6 for unigrams ) is a Chinese restaurant whose customers are the n-gram counts seated over the tables 1 - - -thw.Page 2, “Pitman-Yor process and n-gram models”
- Thus far we have assumed that the unigram G1 is already given, but of course it should also be generated as G1 ~ PY(G0, d, 6).Page 3, “Nested Pitman-Yor Language Model”
- 2Note that this is different from unigrams , which are posterior distribution given data.Page 3, “Nested Pitman-Yor Language Model”
- When a word 21) is generated from its parent at the unigram node, it means that wPage 3, “Nested Pitman-Yor Language Model”
- While previous work used p(l<:|@) = (l —p($))k_1p($), this is only true for unigrams .Page 5, “Inference”
- the number of tables tew for w in word unigrams .Page 6, “Inference”

See all papers in *Proc. ACL 2009* that mention unigrams.

See all papers in *Proc. ACL* that mention unigrams.

Back to top.

Appears in 6 sentences as: Semi-supervised (2) semi-supervised (4)

In *Bayesian Unsupervised Word Segmentation with Nested Pitman-Yor Language Modeling*

- In Section 5 we describe experiments on the standard datasets in Chinese and Japanese in addition to English phonetic transcripts, and semi-supervised experiments are also explored.Page 2, “Introduction”
- Table 4: Semi-supervised and supervised results.Page 7, “Experiments”
- Semi-supervised results used only 10K sentences (1/5) of supervised segmentations.Page 7, “Experiments”
- Furthermore, NPYLM is easily amenable to semi-supervised or even supervised learning.Page 7, “Experiments”
- In semi-supervised case, using only 10K reference segmentations gives a performance of around 90% accuracy and the lowest perplexity, thanks to a combination with unsupervised data in a principled fashion.Page 7, “Experiments”
- Therefore, it is also interesting to combine them in a discriminative way as per-sued in POS tagging using CRF+HMM (Suzuki et al., 2007), let alone a simple semi-supervised approach in Section 5.2.Page 7, “Discussion”

See all papers in *Proc. ACL 2009* that mention semi-supervised.

See all papers in *Proc. ACL* that mention semi-supervised.

Back to top.

Appears in 5 sentences as: dynamic programming (5)

In *Bayesian Unsupervised Word Segmentation with Nested Pitman-Yor Language Modeling*

- In this paper, we propose a new Bayesian model for fully unsupervised word segmentation and an efficient blocked Gibbs sampler combined with dynamic programming for inference.Page 1, “Abstract”
- Section 4 describes an efficient blocked Gibbs sampler that leverages dynamic programming for inference.Page 2, “Introduction”
- To segment a string into “words”, we used efficient dynamic programming combined with MCMC, as described in the next section.Page 4, “Nested Pitman-Yor Language Model”
- Instead, we propose a sentence-wise Gibbs sampler of word segmentation using efficient dynamic programming , as shown in Figure 3.Page 4, “Inference”
- With a combination of dynamic programming and an accurate spelling model from a Bayesian perspective, our model significantly outperforms the previous reported results, and the inference is very efficient.Page 8, “Conclusion”

See all papers in *Proc. ACL 2009* that mention dynamic programming.

See all papers in *Proc. ACL* that mention dynamic programming.

Back to top.

Appears in 4 sentences as: hyperparameters (4)

In *Bayesian Unsupervised Word Segmentation with Nested Pitman-Yor Language Modeling*

- 6, d are hyperparameters that can be learned as Gamma and Beta posteriors, respectively, given the data.Page 3, “Pitman-Yor process and n-gram models”
- 9 Sample hyperparameters of 9Page 4, “Inference”
- ba Na) to estimate A from the data for given language and word type.7 Here, l‘(:c) is a Gamma function and a, b are the hyperparameters chosen to give a nearly uniform prior distribution.8Page 5, “Inference”
- Since our implementation is based on Unicode and learns all hyperparameters from the data, we also confirmed that NPYLM segments the Arabic Gigawords equally well.Page 6, “Experiments”

See all papers in *Proc. ACL 2009* that mention hyperparameters.

See all papers in *Proc. ACL* that mention hyperparameters.

Back to top.