Reranking with Linguistic and Semantic Features for Arabic Optical Character Recognition
Tomeh, Nadi and Habash, Nizar and Roth, Ryan and Farra, Noura and Dasigi, Pradeep and Diab, Mona
Article Structure
Abstract
Optical Character Recognition (OCR) systems for Arabic rely on information contained in the scanned images to recognize sequences of characters and on language models to emphasize fluency.
Introduction
Optical Character Recognition (OCR) is the task of converting scanned images of handwritten, typewritten or printed text into machine-encoded text.
Discriminative Reranking for OCR
Each hypothesis in an n-best list {hi};1 is represented by a d-dimensional feature vector xi 6 Rd.
Experiments
3.1 Data and baselines
Conclusion
We presented a set of experiments on incorporating features into an existing OCR system via n-best list reranking.
Topics
reranking
Appears in 14 sentences as: rerank (1) reranked (1) Reranking (1) reranking (12)
In Reranking with Linguistic and Semantic Features for Arabic Optical Character Recognition
- To do so we follow an n-best list reranking approach that exploits recent advances in learning to rank techniques.
Page 1, “Abstract”
- A straightforward alternative which we advocate in this paper is to use the available information to rerank the hypotheses in the n-best lists.
Page 1, “Introduction”
- Discriminative reranking allows each hypothesis to be represented as an arbitrary set of features without the need to explicitly model their interactions.
Page 1, “Introduction”
- We describe our features and reranking approach in §2, and we present our experiments and results in §3.
Page 1, “Introduction”
- 2.2 Ensemble reranking
Page 2, “Discriminative Reranking for OCR”
- In addition to the above mentioned approaches, we couple simple feature selection and reranking models combination via a straightforward ensemble learning method similar to stacked generalization (Wolpert, 1992) and Combiner (Chan and Stolfo, 1993).
Page 2, “Discriminative Reranking for OCR”
- These features are used by the baseline system5 as well as by the various reranking methods.
Page 3, “Discriminative Reranking for OCR”
- Table 2: WER for baseline, oracle and best reranked hypotheses.
Page 4, “Discriminative Reranking for OCR”
- Table 2 presents the WER for our baseline hypothesis, the best hypothesis in the list (our oracle) and our best reranking results which we describe in details in §3.2.
Page 4, “Experiments”
- on the reranking performance for one of our best reranking models, namely RankSVM.
Page 4, “Experiments”
- 3.2 Reranking results
Page 4, “Experiments”
See all papers in Proc. ACL 2013 that mention reranking.
See all papers in Proc. ACL that mention reranking.
Back to top.
LM
Appears in 9 sentences as: LM (10)
In Reranking with Linguistic and Semantic Features for Arabic Optical Character Recognition
- The BBN Byblos OCR system (Natajan et al., 2002; Prasad et al., 2008; Saleem et al., 2009), which we use in this paper, relies on a hidden Markov model (HMM) to recover the sequence of characters from the image, and uses an n-gram language model ( LM ) to emphasize the fluency of the output.
Page 1, “Introduction”
- For an input image, the OCR decoder generates an n-best list of hypotheses each of which is associated with HMM and LM scores.
Page 1, “Introduction”
- Base features include the HMM and LM scores produced by the OCR system.
Page 3, “Discriminative Reranking for OCR”
- Word LM features (“LM-word”) include the log probabilities of the hypothesis obtained using n-gram LMs with n E {1, .
Page 3, “Discriminative Reranking for OCR”
- The LM models are built using the SRI Language Modeling Toolkit (Stolcke, 2002).
Page 3, “Discriminative Reranking for OCR”
- Linguistic LM features (“LM-MADA”) are similar to the word LM features except that they are computed using the part-of-speech and the lemma of the words instead of the actual words.6
Page 3, “Discriminative Reranking for OCR”
- 5 The baseline ranking is simply based on the sum of the logs of the HMM and LM scores.
Page 3, “Discriminative Reranking for OCR”
- Our baseline is based on the sum of the logs of the HMM and LM scores.
Page 4, “Experiments”
- For LM training we used 220M words from Arabic Gigaword 3, and 2.4M words from each “print” and “hand” ground truth annotations.
Page 4, “Experiments”
See all papers in Proc. ACL 2013 that mention LM.
See all papers in Proc. ACL that mention LM.
Back to top.
error rate
Appears in 4 sentences as: Error Rate (1) error rate (3)
In Reranking with Linguistic and Semantic Features for Arabic Optical Character Recognition
- We achieve 10.1% and 11.4% reduction in recognition word error rate (WER) relative to a standard baseline system on typewritten and handwritten Arabic respectively.
Page 1, “Abstract”
- The loss is computed as the Word Error Rate (WER) of the
Page 1, “Discriminative Reranking for OCR”
- During training, the weights are updated according to the Margin-Infused Relaxed Algorithm (MIRA), whenever the highest scoring hypothesis differs from the hypothesis with the lowest error rate .
Page 2, “Discriminative Reranking for OCR”
- We note that a small number of hypotheses per list is sufficient for RankSVM to obtain a good performance, but also increasing 73 further seems to increase the error rate .
Page 4, “Experiments”
See all papers in Proc. ACL 2013 that mention error rate.
See all papers in Proc. ACL that mention error rate.
Back to top.
Language Modeling
Appears in 3 sentences as: language model (1) Language Modeling (1) language models (1)
In Reranking with Linguistic and Semantic Features for Arabic Optical Character Recognition
- Optical Character Recognition (OCR) systems for Arabic rely on information contained in the scanned images to recognize sequences of characters and on language models to emphasize fluency.
Page 1, “Abstract”
- The BBN Byblos OCR system (Natajan et al., 2002; Prasad et al., 2008; Saleem et al., 2009), which we use in this paper, relies on a hidden Markov model (HMM) to recover the sequence of characters from the image, and uses an n-gram language model (LM) to emphasize the fluency of the output.
Page 1, “Introduction”
- The LM models are built using the SRI Language Modeling Toolkit (Stolcke, 2002).
Page 3, “Discriminative Reranking for OCR”
See all papers in Proc. ACL 2013 that mention Language Modeling.
See all papers in Proc. ACL that mention Language Modeling.
Back to top.
n-gram
Appears in 3 sentences as: n-gram (3)
In Reranking with Linguistic and Semantic Features for Arabic Optical Character Recognition
- The BBN Byblos OCR system (Natajan et al., 2002; Prasad et al., 2008; Saleem et al., 2009), which we use in this paper, relies on a hidden Markov model (HMM) to recover the sequence of characters from the image, and uses an n-gram language model (LM) to emphasize the fluency of the output.
Page 1, “Introduction”
- Word LM features (“LM-word”) include the log probabilities of the hypothesis obtained using n-gram LMs with n E {1, .
Page 3, “Discriminative Reranking for OCR”
- Semantic coherence feature (“SemCoh”) is motivated by the fact that semantic information can be very useful in modeling the fluency of phrases, and can augment the information provided by n-gram LMs.
Page 3, “Discriminative Reranking for OCR”
See all papers in Proc. ACL 2013 that mention n-gram.
See all papers in Proc. ACL that mention n-gram.
Back to top.
semantic relatedness
Appears in 3 sentences as: semantic relatedness (3)
In Reranking with Linguistic and Semantic Features for Arabic Optical Character Recognition
- To strike a balance between these two extremes, we introduce a novel model of semantic coherence that is based on a measure of semantic relatedness between pairs of words.
Page 3, “Discriminative Reranking for OCR”
- We model semantic relatedness between two words using the Information Content (IC) of the pair in a method similar to the one used by Lin (1997) and Lin (1998).
Page 3, “Discriminative Reranking for OCR”
- During testing, for each phrase in our test set, we measure semantic relatedness of pairs of words using the IC values estimated from the Arabic Gigaword, and normalize their sum by the number of pairs in the phrase to obtain a measure of Semantic Coherence (SC) of the phrase.
Page 3, “Discriminative Reranking for OCR”
See all papers in Proc. ACL 2013 that mention semantic relatedness.
See all papers in Proc. ACL that mention semantic relatedness.
Back to top.