Conditional Random Fields for Word Hyphenation
Trogkanis, Nikolaos and Elkan, Charles

Article Structure

Abstract

Finding allowable places in words to insert hyphens is an important practical problem.

Introduction

The task that we investigate is learning to split words into parts that are conventionally agreed to be individual written units.

History of automated hyphenation

The earliest software for automatic hyphenation was implemented for RCA 301 computers, and used by the Palm Beach Post-Tribune and L05 Angeles Times newspapers in 1962.

Conditional random fields

A linear-chain conditional random field (Lafferty et al., 2001) is a way to use a log-linear model for the sequence prediction task.

Dataset creation

We start with the lexicon for English published by the Dutch Centre for Lexical Information at http: / /www .

Experimental design

We use tenfold cross validation for the experiments.

Experimental results

In Table 2 and Table 3 we report the performance of the different methods on the English and Dutch datasets respectively.

Additional experiments

This section presents empirical results following two experimental designs that are less standard, but that may be more appropriate for the hyphenation task.

Timings

Table 7 shows the speed of the alternative methods for the English dataset.

Conclusions

Finding allowable places in words to insert hyphens is a real-world problem that is still not fully solved in practice.

Topics

error rate

Appears in 28 sentences as: error rate (24) error rates (8)
In Conditional Random Fields for Word Hyphenation
  1. We create new training sets for English and Dutch from the CELEX European lexical resource, and achieve error rates for English of less than 0.1% for correctly allowed hyphens, and less than 0.01% for Dutch.
    Page 1, “Abstract”
  2. Experiments show that both the Knuth/Liang method and a leading current commercial altema-tive have error rates several times higher for both languages.
    Page 1, “Abstract”
  3. The lowest per-letter test error rate reported is about 2%.
    Page 2, “History of automated hyphenation”
  4. In order to measure accuracy, we compute the confusion matrix for each method, and from this we compute error rates .
    Page 4, “Experimental design”
  5. We report both word-level and letter-level error rates .
    Page 4, “Experimental design”
  6. The word-level error rate is the fraction of words on which a method makes at least one mistake.
    Page 4, “Experimental design”
  7. The letter-level error rate is the fraction of letters for which the method predicts incorrectly whether or not a hyphen is legal after this letter.
    Page 4, “Experimental design”
  8. Figure 1 shows how the error rate is affected by increasing the CRF probability threshold for each language.
    Page 5, “Experimental results”
  9. Figure 1 shows confidence intervals for the error rates .
    Page 5, “Experimental results”
  10. All differences between rows in Table 2 are significant, with one exception: the serious error rates for PAT GEN and TALO are not statistically significantly different.
    Page 5, “Experimental results”
  11. For the English language, the CRF using the Viterbi path has overall error rate of 0.84%, compared to 6.81% for the TEX algorithm using American English patterns, which is eight times worse.
    Page 5, “Experimental results”

See all papers in Proc. ACL 2010 that mention error rate.

See all papers in Proc. ACL that mention error rate.

Back to top.

CRF

Appears in 27 sentences as: CRF (25) CRF++ (4)
In Conditional Random Fields for Word Hyphenation
  1. Training a CRF means
    Page 3, “Conditional random fields”
  2. The software we use as an implementation of conditional random fields is named CRF++ (Kudo, 2007).
    Page 3, “Conditional random fields”
  3. We adopt the default parameter settings of CRF++ , so no development set or tuning set is needed in our work.
    Page 3, “Conditional random fields”
  4. The standard Viterbi algorithm for making predictions from a trained CRF is not tuned to minimize false positives.
    Page 4, “Conditional random fields”
  5. These are learning experiments so we also use tenfold cross validation in the same way as with CRF++ .
    Page 5, “Experimental design”
  6. Figure 1 shows how the error rate is affected by increasing the CRF probability threshold for each language.
    Page 5, “Experimental results”
  7. For the English language, the CRF using the Viterbi path has overall error rate of 0.84%, compared to 6.81% for the TEX algorithm using American English patterns, which is eight times worse.
    Page 5, “Experimental results”
  8. However, the serious error rate for the CRF is less good: 0.41% compared to 0.24%.
    Page 5, “Experimental results”
  9. Figure 1 shows that the CRF can use a probability threshold up to 0.99, and still have lower overall error rate than the TEX algorithm.
    Page 5, “Experimental results”
  10. Fixing the probability threshold at 0.99, the CRF serious error rate is 0.04% (224 false positives) compared to 0.24% (1343 false positives) for the TEX algorithm.
    Page 5, “Experimental results”
  11. For the English language, TALO yields overall error rate 1.99% with serious error rate 0.72%, so the standard CRF using the Viterbi path is better on both measures.
    Page 7, “Experimental results”

See all papers in Proc. ACL 2010 that mention CRF.

See all papers in Proc. ACL that mention CRF.

Back to top.

word-level

Appears in 5 sentences as: word-level (5)
In Conditional Random Fields for Word Hyphenation
  1. The accuracy we achieve is slightly higher: word-level accuracy of 96.33% compared to their
    Page 2, “History of automated hyphenation”
  2. We report both word-level and letter-level error rates.
    Page 4, “Experimental design”
  3. The word-level error rate is the fraction of words on which a method makes at least one mistake.
    Page 4, “Experimental design”
  4. Specifically, for English our word-level accuracy (“ower”) is 96.33% while their best (“WA”) is 95.65%.
    Page 5, “Experimental design”
  5. For both languages, PAT GEN has higher serious letter-level and word-level error rates than TEX using the existing pattern files.
    Page 7, “Experimental results”

See all papers in Proc. ACL 2010 that mention word-level.

See all papers in Proc. ACL that mention word-level.

Back to top.

CRFs

Appears in 4 sentences as: CRFs (4)
In Conditional Random Fields for Word Hyphenation
  1. Research on structured learning has been highly successful, with sequence classification as its most important and successful subfield, and with conditional random fields ( CRFs ) as the most influential approach to learning sequence classifiers.
    Page 1, “Introduction”
  2. we show that CRFs can achieve extremely good performance on the hyphenation task.
    Page 2, “Introduction”
  3. plication of CRFs , which are a major advance of recent years in machine learning.
    Page 8, “Conclusions”
  4. A third contribution of our work is a demonstration that current CRF methods can be used straightforwardly for an important application and outperform state-of—the-art commercial and open-source software; we hope that this demonstration accelerates the widespread use of CRFs .
    Page 8, “Conclusions”

See all papers in Proc. ACL 2010 that mention CRFs.

See all papers in Proc. ACL that mention CRFs.

Back to top.

Viterbi

Appears in 4 sentences as: Viterbi (4)
In Conditional Random Fields for Word Hyphenation
  1. The standard Viterbi algorithm for making predictions from a trained CRF is not tuned to minimize false positives.
    Page 4, “Conditional random fields”
  2. For the English language, the CRF using the Viterbi path has overall error rate of 0.84%, compared to 6.81% for the TEX algorithm using American English patterns, which is eight times worse.
    Page 5, “Experimental results”
  3. For the English language, TALO yields overall error rate 1.99% with serious error rate 0.72%, so the standard CRF using the Viterbi path is better on both measures.
    Page 7, “Experimental results”
  4. For the Dutch language, the standard CRF using the Viterbi path has overall error rate 0.08%, compared to 0.81% for the TEX algorithm.
    Page 7, “Experimental results”

See all papers in Proc. ACL 2010 that mention Viterbi.

See all papers in Proc. ACL that mention Viterbi.

Back to top.

cross validation

Appears in 3 sentences as: cross validation (3)
In Conditional Random Fields for Word Hyphenation
  1. We use tenfold cross validation for the experiments.
    Page 4, “Experimental design”
  2. These are learning experiments so we also use tenfold cross validation in the same way as with CRF++.
    Page 5, “Experimental design”
  3. Because cross validation is applied, errors are always measured on testing subsets that are disjoint from the corresponding training subsets.
    Page 7, “Additional experiments”

See all papers in Proc. ACL 2010 that mention cross validation.

See all papers in Proc. ACL that mention cross validation.

Back to top.

machine learning

Appears in 3 sentences as: machine learning (3)
In Conditional Random Fields for Word Hyphenation
  1. Over the years, various machine learning methods have been applied to the hyphenation task.
    Page 2, “History of automated hyphenation”
  2. As is the case with many machine learning methods, no strong guidance is available for choosing values for these parameters.
    Page 5, “Experimental design”
  3. plication of CRFs, which are a major advance of recent years in machine learning .
    Page 8, “Conclusions”

See all papers in Proc. ACL 2010 that mention machine learning.

See all papers in Proc. ACL that mention machine learning.

Back to top.