Using Deep Morphology to Improve Automatic Error Detection in Arabic Handwriting Recognition
Habash, Nizar and Roth, Ryan

Article Structure

Abstract

Arabic handwriting recognition (HR) is a challenging problem due to Arabic’s connected letter forms, consonantal diacritics and rich morphology.

Introduction

After years of development, optical character recognition (OCR) for Latin-character languages, such as English, has been refined greatly.

Arabic Handwriting Recognition Challenges

Arabic has several orthographic and morphological properties that make HR challenging (Darwish and Card, 2002; Magdy and Darwish, 2006; M‘argner and Abed, 2009).

Problem Zones in Handwriting Recognition

3.1 HR Error Classifications

Experimental Settings

4.1 Training and Evaluation Data

Results

We describe next different experiments conducted by varying the features used in the PZD model.

Related Work

Common OCM—IR postprocessing strategies are similar to spelling correction solutions involving dictionary lookup (Kukich, 1992; J urafsky and Martin, 2000) and morphological restrictions (Domeij et al., 1994; Oflazer, 1996).

Conclusions and Future Work

We presented a study with various settings (linguistic and nonlinguistic features and learning curve) for automatically detecting problem words in Arabic handwriting recognition.

Topics

F-score

Appears in 13 sentences as: F-score (14)
In Using Deep Morphology to Improve Automatic Error Detection in Arabic Handwriting Recognition
  1. Our best approach achieves a roughly ~15% absolute increase in F-score over a simple but reasonable baseline.
    Page 1, “Abstract”
  2. We present the results in terms of F-score only for simplicity; we then conduct an error analysis that examines precision and recall.
    Page 5, “Results”
  3. Feature Set F-score %Imp word 43.85 —word+nw 43.86 N0 word+na 44.78 2.1 word+lem 45.85 4.6 word+pos 45.91 4.7 word+nw+pos+lem+na 46.34 5.7
    Page 5, “Results”
  4. Feature Set F-score %Imp word 43.85 —
    Page 5, “Results”
  5. Base Feature Set F-score %Imp +conf word 43.85 55.83 27.3 +nw N-grams 59.33 61.71 4.0 +lem 60.92 62.60 2.8 +lem+na 60.47 63.14 4.4 +lem+lem N—grams 60.44 62.88 4.0 +pos+pos N—grams
    Page 6, “Results”
  6. Feature Set F-score F-score %Imp word 43.85 52.08 18.8 word+conf 55.83 57.50 3.0
    Page 6, “Results”
  7. For convenience, in the next section we refer to the third model listed in Table 6 as the best system (because it has the highest absolute F-score on the large data set), but readers should recall that these four models are roughly equivalent in performance.
    Page 6, “Results”
  8. (a) 8:4000 8:2000 word wconf best all all Precision 54.7 59.5 67.1 67.4 62.4 Recall 49.7 55.7 65.6 64.0 62.5 F-score 52.1 57.5 66.3 65.6 62.4 Accuracy 76.4 78.7 82.8 82.7 80.6
    Page 7, “Results”
  9. We consider the performance in terms of precision and recall in addition to F-score — see Table 7 (a).
    Page 7, “Results”
  10. word wconf best all Precision 37.55 51.48 57.01 55.46 Recall 51.73 53.39 61.97 60.44 F-score 43.51 52.42 59.39 57.84 Accuracy 65.13 74.83 77.99 77.13
    Page 8, “Results”
  11. The best F-score combination was with n = 2 (any two agree) producing 62.8/72.4/67.3, an almost 1% higher than our best system.
    Page 8, “Results”

See all papers in Proc. ACL 2011 that mention F-score.

See all papers in Proc. ACL that mention F-score.

Back to top.

Feature Set

Appears in 9 sentences as: Feature Set (5) feature set (2) feature sets (2)
In Using Deep Morphology to Improve Automatic Error Detection in Arabic Handwriting Recognition
  1. 5.1 Effect of Feature Set Choice
    Page 5, “Results”
  2. Table 3 illustrates the result of taking a baseline feature set (containing word as the only feature) and adding a single feature from the Simple set to it.
    Page 5, “Results”
  3. Feature Set F-score %Imp word 43.85 —word+nw 43.86 N0 word+na 44.78 2.1 word+lem 45.85 4.6 word+pos 45.91 4.7 word+nw+pos+lem+na 46.34 5.7
    Page 5, “Results”
  4. Feature Set F-score %Imp word 43.85 —
    Page 5, “Results”
  5. Base Feature Set F-score %Imp +conf word 43.85 55.83 27.3 +nw N-grams 59.33 61.71 4.0 +lem 60.92 62.60 2.8 +lem+na 60.47 63.14 4.4 +lem+lem N—grams 60.44 62.88 4.0 +pos+pos N—grams
    Page 6, “Results”
  6. Table 5: PZD F-scores for models when word confidence is added to the feature set .
    Page 6, “Results”
  7. In Table 5, we show the effect of adding conf as a feature to several base feature sets taken from Table 4.
    Page 6, “Results”
  8. Feature Set F-score F-score %Imp word 43.85 52.08 18.8 word+conf 55.83 57.50 3.0
    Page 6, “Results”
  9. We test this assumption by taking the best-performing feature sets from Table 5 and training new models using twice the training data {8:4000}.
    Page 6, “Results”

See all papers in Proc. ACL 2011 that mention Feature Set.

See all papers in Proc. ACL that mention Feature Set.

Back to top.

N-grams

Appears in 9 sentences as: (2) N-grams (6) N-grams, (1)
In Using Deep Morphology to Improve Automatic Error Detection in Arabic Handwriting Recognition
  1. nw N—grams Normword 1/2/3-gram probabilities lem N—grams Lemma 1/2/3-gram probabilities pos N-grams POS 1/2/3-gram probabilities
    Page 4, “Experimental Settings”
  2. served for word N-grams which did not appear in the models).
    Page 5, “Experimental Settings”
  3. Like the N-grams , this number is binned; in this case there are 11 bins, with 10 spread evenly over the [0,1) range, and an extra bin for values of exactly 1 (i.e., when the word appears in every hypothesis in the set).
    Page 5, “Experimental Settings”
  4. First, Table 4 shows the effect of adding nw N-grams of successively higher orders to the word baseline.
    Page 5, “Results”
  5. word+nw l—gram 49.51 12.9 word+nw l—gram+nw 2—gram 59.26 35.2 word+nw N—grams 59.33 35.3 +pos 58.50 33.4 +pos N-grams 57.35 30.8 +lem+lem N—grams 59.63 36.0 +lem+lem N—grams+na 59.93 36.7 +lem+lem N-grams+na+nw 59.77 36.3 +lem 60.92 38.9 +lem+na 60.47 37.9 +lem+lem N—grams 60.44 37.9
    Page 5, “Results”
  6. Here, the best performer is the model which utilizes the word, nw N-grams,
    Page 5, “Results”
  7. Base Feature Set F-score %Imp +conf word 43.85 55.83 27.3 +nw N-grams 59.33 61.71 4.0 +lem 60.92 62.60 2.8 +lem+na 60.47 63.14 4.4 +lem+lem N—grams 60.44 62.88 4.0 +pos+pos N—grams
    Page 6, “Results”
  8. The label "N-grams" following a Binned feature refers to using 1, 2 and 3-grams of that feature.
    Page 6, “Results”
  9. The label "N-grams" following a Binned feature refers to using 1, 2 and 3-grams of that feature.
    Page 6, “Results”

See all papers in Proc. ACL 2011 that mention N-grams.

See all papers in Proc. ACL that mention N-grams.

Back to top.

statistically significant

Appears in 6 sentences as: statistically significant (6)
In Using Deep Morphology to Improve Automatic Error Detection in Arabic Handwriting Recognition
  1. However, the differences among this model and the other models using lem Table 4 are not statistically significant .
    Page 6, “Results”
  2. The differences between this model and the other lower performing models are statistically significant (p<0.05).
    Page 6, “Results”
  3. The differences among the last four models (all including lem) in Table 5 are not statistically significant .
    Page 6, “Results”
  4. The differences between these four models and the first two are statistically significant (p<0.05).
    Page 6, “Results”
  5. We note that the value of doubling S is roughly 3-6X times greater for the word baseline than the others; however, simply adding conf to the baseline provides an even greater improvement than doubling S. The differences between the final four models in Table 6 are not statistically significant .
    Page 6, “Results”
  6. The differences between these models and the first two models in the table are statistically significant (p<0.05).
    Page 6, “Results”

See all papers in Proc. ACL 2011 that mention statistically significant.

See all papers in Proc. ACL that mention statistically significant.

Back to top.

language model

Appears in 5 sentences as: language model (3) Language Modeling (1) language models (2)
In Using Deep Morphology to Improve Automatic Error Detection in Arabic Handwriting Recognition
  1. Digits on the other hand are a hard class to language model since the vocabulary (of multi-digit numbers) is infinite.
    Page 3, “Problem Zones in Handwriting Recognition”
  2. The HR system output does not contain any illegal non-words since its vocabulary is restricted by its training data and language models .
    Page 3, “Problem Zones in Handwriting Recognition”
  3. The models are built using the SRI Language Modeling Toolkit (Stolcke, 2002).
    Page 4, “Experimental Settings”
  4. Alternatively, morphological information can be used to construct supplemental lexicons or language models (Sari and Sellami, 2002; Magdy and Darwish, 2006).
    Page 8, “Related Work”
  5. Their hypothesis that their large language model (16M words) may be responsible for why the word-based models outperformed stem-based (morphological) models is challenged by the fact that our language model data (220M words) is an order of magnitude larger, but we are still able to show benefit for using morphology.
    Page 9, “Related Work”

See all papers in Proc. ACL 2011 that mention language model.

See all papers in Proc. ACL that mention language model.

Back to top.

precision and recall

Appears in 4 sentences as: precision and recall (4)
In Using Deep Morphology to Improve Automatic Error Detection in Arabic Handwriting Recognition
  1. We present the results in terms of F-score only for simplicity; we then conduct an error analysis that examines precision and recall .
    Page 5, “Results”
  2. We consider the performance in terms of precision and recall in addition to F-score — see Table 7 (a).
    Page 7, “Results”
  3. Overall, there is no major tradeoff between precision and recall across the different settings; although we can observe the following: (i) adding more training data helps precision more than recall (over three times more) — compare the last two columns in Table 7 (a); and (ii) the best setting has a slightly lower precision than all features, although a much better recall — compare columns 4 and 5 in Table 7 (a).
    Page 7, “Results”
  4. These basic exploratory experiments show that there is a lot of value in pursuing combinations of systems, if not for overall improvement, then at least to benefit from tradeoffs in precision and recall that may be appropriate for different applications.
    Page 8, “Results”

See all papers in Proc. ACL 2011 that mention precision and recall.

See all papers in Proc. ACL that mention precision and recall.

Back to top.

SVM

Appears in 4 sentences as: SVM (4)
In Using Deep Morphology to Improve Automatic Error Detection in Arabic Handwriting Recognition
  1. The PZD system relies on a set of SVM classifiers trained using morphological and lexical features.
    Page 4, “Experimental Settings”
  2. The SVM classifiers are built using Yamcha (Kudo and Matsumoto, 2003).
    Page 4, “Experimental Settings”
  3. Simple features are used directly by the PZD SVM models, whereas Binned features’ (numerical) values are reduced to a small, labeled category set whose labels are used as model features.
    Page 4, “Experimental Settings”
  4. The bin labels are used as the SVM features.
    Page 5, “Experimental Settings”

See all papers in Proc. ACL 2011 that mention SVM.

See all papers in Proc. ACL that mention SVM.

Back to top.

part-of-speech

Appears in 3 sentences as: part-of-speech (3)
In Using Deep Morphology to Improve Automatic Error Detection in Arabic Handwriting Recognition
  1. In this paper we consider the value of morpho-leXical and morpho-syntactic features such as lemmas and part-of-speech tags, respectively, that may allow machine learning algorithms to learn generalizations.
    Page 2, “Arabic Handwriting Recognition Challenges”
  2. Ya and digit normalization pos The part-of-speech (POS) of the word lem The lemma of the word
    Page 4, “Experimental Settings”
  3. part-of-speech tags, which they do not use, but suggest may help.
    Page 9, “Related Work”

See all papers in Proc. ACL 2011 that mention part-of-speech.

See all papers in Proc. ACL that mention part-of-speech.

Back to top.