Text Segmentation by Language Using Minimum Description Length
Yamaguchi, Hiroshi and Tanaka-Ishii, Kumiko

Article Structure

Abstract

The problem addressed in this paper is to segment a given multilingual document into seg-nunusforeachlanguageandthenidenfifythe language of each segment.

Introduction

For the purposes of this paper, a multilingual text means one containing text segments, limited to those longer than a clause, written in different languages.

Problem Formulation

In our setting, we assume that a small amount (up to kilobytes) of monolingual plain text sample data is available for every language, e.g., the Universal Declaration of Human Rights, which serves to generate the language model used for language identification.

Calculation of Cross-Entropy

The first term of (3), — log2 PLZ.

Segmentation by Dynamic Programming

By applying the above methods, we propose a solution to formula (1) through dynamic programming.

In the experiments reported here, n is set to 5 throughout.

Considering the additive characteristic of the description length formulated previously as formula (1), we denote the minimized description length for a given text X simply as DP(X), which can be decomposed recursively as follows6:

Experimental Results

6.1 Language Identification Performance

Conclusion

This article has presented a method for segmenting a multilingual text into segments, each in a different language.

Topics

language model

Appears in 9 sentences as: language model (7) language modeling (3)
In Text Segmentation by Language Using Minimum Description Length
  1. They used statistical language modeling and heuristics to detect foreign words and tested the case of English embedded in German texts.
    Page 1, “Introduction”
  2. In our setting, we assume that a small amount (up to kilobytes) of monolingual plain text sample data is available for every language, e.g., the Universal Declaration of Human Rights, which serves to generate the language model used for language identification.
    Page 2, “Problem Formulation”
  3. calculates the description length of a text segment X,- through the use of a language model for Li.
    Page 2, “Problem Formulation”
  4. Here, the first term corresponds to the code length of the text chunk X,- given a language model for L,, which in fact corresponds to the cross—entropy of X,- for L,- multiplied by The remaining terms give the code lengths of the parameters used to describe the length of the first term: the second term corresponds to the segment location; the third term, to the identified language; and the fourth term, to the language model of language Li.
    Page 3, “Problem Formulation”
  5. This fourth term will differ according to the language model type; moreover, its value can be further minimized through formula (2).
    Page 3, “Problem Formulation”
  6. (XZ-), is the cross—entropy of X7; for L,- multiplied by Various methods for computing cross—entropy have been proposed, and these can be roughly classified into two types based on different methods of universal coding and the language model .
    Page 3, “Calculation of Cross-Entropy”
  7. For example, (Benedetto et al., 2002) and (Cilibrasi and Vitanyi, 2005) used the universal coding approach, whereas (Teahan and Harper, 2001) and (Sibun and Reynar, 1996) were based on language modeling using PPM and Kullback—Leibler divergence, respectively.
    Page 3, “Calculation of Cross-Entropy”
  8. As a representative method for calculating the cross—entropy through statistical language modeling , we adopt prediction by partial matching (PPM), a language—based encoding method devised by (Cleary and Witten, 1984).
    Page 4, “Calculation of Cross-Entropy”
  9. lel ), gives the description length of the remaining characters under the language model for L.
    Page 4, “In the experiments reported here, n is set to 5 throughout.”

See all papers in Proc. ACL 2012 that mention language model.

See all papers in Proc. ACL that mention language model.

Back to top.

dynamic programming

Appears in 5 sentences as: dynamic programming (5)
In Text Segmentation by Language Using Minimum Description Length
  1. The problem is formulated in terms of obtaining the minimum description length of a text, and the proposed solution finds the segments and their languages through dynamic programming .
    Page 1, “Abstract”
  2. Nevertheless, since we use a uniform amount of training data for every language, and since varying 7 would prevent us from improving the efficiency of dynamic programming , as explained in §4, in this article we set 7 to a constant obtained empirically.
    Page 3, “Problem Formulation”
  3. By applying the above methods, we propose a solution to formula (1) through dynamic programming .
    Page 4, “Segmentation by Dynamic Programming”
  4. We can straightforwardly implement this recur—sive computation through dynamic programming , by managing a table of size |X | x To fill a cell of this table, formula (4) suggests referring to t x |£| cells and calculating the description length of the rest of the text for O( |X | —t) cells for each language.
    Page 4, “In the experiments reported here, n is set to 5 throughout.”
  5. An actual procedure for obtaining an optimal result through dynamic programming was proposed.
    Page 8, “Conclusion”

See all papers in Proc. ACL 2012 that mention dynamic programming.

See all papers in Proc. ACL that mention dynamic programming.

Back to top.

gold standard

Appears in 3 sentences as: gold standard (3)
In Text Segmentation by Language Using Minimum Description Length
  1. In automatic processing of such multilingual texts, they must first be segmented by language, and the language of each segment must be identified, since many state—of—the—art NLP applications are built by learning a gold standard for one specific language.
    Page 1, “Introduction”
  2. For this purpose, he used a gold standard of multilingual texts annotated by borders and languages.
    Page 1, “Introduction”
  3. Although the problem setting is similar to ours, the formulation and solution are different, particularly in that our method uses only a monolingual gold standard , not a multilingual one as in Teahan’s study.
    Page 1, “Introduction”

See all papers in Proc. ACL 2012 that mention gold standard.

See all papers in Proc. ACL that mention gold standard.

Back to top.

optimization problem

Appears in 3 sentences as: optimization problem (3)
In Text Segmentation by Language Using Minimum Description Length
  1. This article presents one way to formulate the segmentation and identification problem as a combinatorial optimization problem ; specifically, to find the set of segments and their languages that minimizes the description length of a given multilingual text.
    Page 2, “Introduction”
  2. In this section, we briefly introduce two methods previously studied by (Juola, 1997) and (Teahan, 2000) as representative of the two types, and we further explain a modification that we integrate into the final optimization problem .
    Page 3, “Calculation of Cross-Entropy”
  3. The segmentation task was modeled as an optimization problem of finding the best segment and language sequences to minimize the description length of a given text.
    Page 8, “Conclusion”

See all papers in Proc. ACL 2012 that mention optimization problem.

See all papers in Proc. ACL that mention optimization problem.

Back to top.