Sentence Level Dialect Identification in Arabic
Elfardy, Heba and Diab, Mona

Article Structure

Abstract

This paper introduces a supervised approach for performing sentence level dialect identification between Modern Standard Arabic and Egyptian Dialectal Arabic.

Introduction

The Arabic language exists in a state of Diglos-sia (Ferguson, 1959) where the standard form of the language, Modern Standard Arabic (MSA) and the regional dialects (DA) live side-by-side and are closely related.

Related Work

Dialect Identification in Arabic is crucial for almost all NLP tasks, yet most of the research in Arabic NLP, with few exceptions, is targeted towards MSA.

Approach to Sentence-Level Dialect Identification

We present a supervised system that uses a Naive Bayes classifier trained on gold labeled data with sentence level binary decisions of either being MSA or DA.

Experiments

4.1 Data

Conclusion

We presented a supervised approach for sentence level dialect identification in Arabic.

Topics

LM

Appears in 15 sentences as: LM (16)
In Sentence Level Dialect Identification in Arabic
  1. Amazon Mechanical Turk and try a language modeling ( LM ) approach to solve the problem.
    Page 2, “Related Work”
  2. The aforementioned approach relies on language models ( LM ) and MSA and EDA Morphological Analyzer to decide whether each word is (a) MSA, (b) EDA, (c) Both (MSA & EDA) or (d) OOV.
    Page 2, “Approach to Sentence-Level Dialect Identification”
  3. The following variants of the underlying token-level system are built to assess the effect of varying the level of preprocessing on the underlying LM on the performance of the overall sentence level dialect identification process: (1) Surface, (2) Tokenized, (3) CODAfied, and (4) Tokenized-CODA.
    Page 2, “Approach to Sentence-Level Dialect Identification”
  4. Orthography Normalized (CODAfied) LM : since DA is not originally a written form of Arabic, no standard orthography exists for it.
    Page 2, “Approach to Sentence-Level Dialect Identification”
  5. We use the implementation of CODA presented in CODAfy (Eskander et al., 2013), to build an orthography-normalized LM .
    Page 2, “Approach to Sentence-Level Dialect Identification”
  6. On the other hand, CODA solves the sparseness issue by mapping multiple spelling-variants to the same orthographic form leading to a more robust LM .
    Page 2, “Approach to Sentence-Level Dialect Identification”
  7. Tokenized LM : D3 tokenization-scheme is applied to all data using MADA (Habash et al., 2009) (an MSA Tokenizer) for the MSA corpora, and MADA-ARZ (Habash et al., 2013) (an EDA tokenizer) for the EDA corpora.
    Page 2, “Approach to Sentence-Level Dialect Identification”
  8. For building the tokenized LM , we maintain clitics and lexemes.
    Page 2, “Approach to Sentence-Level Dialect Identification”
  9. Some clitics are unique to MSA while others are unique to EDA so maintaining them in the LM is help-
    Page 2, “Approach to Sentence-Level Dialect Identification”
  10. the negation enclitic U: $ is only used in EDA but it could be seen with an MSA/EDA homograph, maintaining the en-clitic in the LM facilitates the identification
    Page 2, “Approach to Sentence-Level Dialect Identification”
  11. The perpleXity conveys how confused the LM is about the given sentence so the higher the perplexity value, the less probable that the given sentence matches the LM.2
    Page 3, “Approach to Sentence-Level Dialect Identification”

See all papers in Proc. ACL 2013 that mention LM.

See all papers in Proc. ACL that mention LM.

Back to top.

cross validation

Appears in 3 sentences as: cross validation (3)
In Sentence Level Dialect Identification in Arabic
  1. The presented system outperforms the approach presented by Zaidan and Callison-Burch (2011) on the same dataset using 10-fold cross validation .
    Page 1, “Introduction”
  2. For both sets of experiments, we apply 10-fold cross validation on the training data.
    Page 3, “Approach to Sentence-Level Dialect Identification”
  3. Table 2: Performance Accuracies of the different configurations of the 8M LM (best-performing LM size) using 10-fold cross validation against the different baselines.
    Page 4, “Experiments”

See all papers in Proc. ACL 2013 that mention cross validation.

See all papers in Proc. ACL that mention cross validation.

Back to top.

language model

Appears in 3 sentences as: language model (1) language modeling (1) language models (1)
In Sentence Level Dialect Identification in Arabic
  1. Amazon Mechanical Turk and try a language modeling (LM) approach to solve the problem.
    Page 2, “Related Work”
  2. The aforementioned approach relies on language models (LM) and MSA and EDA Morphological Analyzer to decide whether each word is (a) MSA, (b) EDA, (c) Both (MSA & EDA) or (d) OOV.
    Page 2, “Approach to Sentence-Level Dialect Identification”
  3. The perplexity of a language model on a given test sentence; S(w1, .., wn) is defined as:
    Page 3, “Approach to Sentence-Level Dialect Identification”

See all papers in Proc. ACL 2013 that mention language model.

See all papers in Proc. ACL that mention language model.

Back to top.

morphological analyzer

Appears in 3 sentences as: Morphological Analyzer (1) morphological analyzer (2)
In Sentence Level Dialect Identification in Arabic
  1. The aforementioned approach relies on language models (LM) and MSA and EDA Morphological Analyzer to decide whether each word is (a) MSA, (b) EDA, (c) Both (MSA & EDA) or (d) OOV.
    Page 2, “Approach to Sentence-Level Dialect Identification”
  2. Percentage of words in the sentence that is analyzable by an MSA morphological analyzer .
    Page 3, “Approach to Sentence-Level Dialect Identification”
  3. Percentage of words in the sentence that is analyzable by an EDA morphological analyzer .
    Page 3, “Approach to Sentence-Level Dialect Identification”

See all papers in Proc. ACL 2013 that mention morphological analyzer.

See all papers in Proc. ACL that mention morphological analyzer.

Back to top.