Generalized Character-Level Spelling Error Correction
Farra, Noura and Tomeh, Nadi and Rozovskaya, Alla and Habash, Nizar

Article Structure


We present a generalized discriminative model for spelling error correction which targets character-level transformations.


Spelling error correction is a longstanding Natural Language Processing (NLP) problem, and it has recently become especially relevant because of the many potential applications to the large amount of informal and unedited text generated online, including web forums, tweets, blogs, and email.

Related Work

Most earlier work on automatic error correction addressed spelling errors in English and built models of correct usage on native English data (Ku-kich, 1992; Golding and Roth, 1999; Carlson and Fette, 2007; Banko and Brill, 2001).

The GSEC Approach

3.1 Modeling Spelling Correction at the Character Level


4.1 Model Evaluation


We showed that a generalized character-level spelling error correction model can improve spelling error correction on Egyptian Arabic data.



  1. While operating at the character level, the model makes use of word-level and contextual information.
  2. Discriminative models have been proposed at the word-level for error correction (Duan et al., 2012) and for error detection (Habash and Roth, 2011).
  3. We implemented another approach for error correction based on a word-level maximum likelihood model.
  4. The word-error-rate WER metric is computed by summing the total number of word-level substitution errors, insertion errors, and deletion errors in the output, and dividing by the number of words in the reference.
  5. In the future, we plan to extend the model to use word-level language models to select between top character predictions in the output.
statistically significant

  1. Rows marked with an asterisk (*) are statistically significant compared to CEC (for the first half of the table) or CEC+MLE (for the second half of the table), with p < 0.05.
  2. In fact, CEC+MLE and GSEC+MLE perform similarly (p = 0.36, not statistically significant ).
  3. The two results are statistically significant (p < 0.0001) with respect to CBC and CEC+MLE respectively.
