Word Segmentation of Informal Arabic with Domain Adaptation
Monroe, Will and Green, Spence and Manning, Christopher D.

Article Structure

Abstract

Segmentation of clitics has been shown to improve accuracy on a variety of Arabic NLP tasks.

Introduction

Segmentation of words, clitics, and aflixes is essential for a number of natural language processing (NLP) applications, including machine translation, parsing, and speech recognition (Chang et a1., 2008; Tsarfaty, 2006; Kurimo et a1., 2006).

Arabic Word Segmentation Model

A CRF model (Lafferty et a1., 2001) defines a distri-butionp(Y|X; 6), where X = {$1, .

Experiments

We train and evaluate on three corpora: parts 1—3 of the newswire Arabic Treebank (ATB),1 the Broadcast News Arabic Treebank (BN),2 and parts 1—8 of the BOLT Phase 1 Egyptian Arabic Treebank (ARZ).3 These correspond respectively to the domains in section 2.2.

Error Analysis

We sampled 100 errors randomly from all errors made by our final model (trained on all three datasets with domain adaptation and additional features) on the ARZ development set; see Table 4.

Topics

domain adaptation

Appears in 8 sentences as: Domain adaptation (1) domain adaptation (7)
In Word Segmentation of Informal Arabic with Domain Adaptation
  1. We extend an existing MSA segmenter with a simple domain adaptation technique and new features in order to segment informal and dialectal Arabic text.
    Page 1, “Abstract”
  2. Third, we show that dialectal data can be handled in the framework of domain adaptation .
    Page 1, “Introduction”
  3. 2.2 Domain adaptation
    Page 2, “Arabic Word Segmentation Model”
  4. The approach to domain adaptation we use is that of feature space augmentation (Daumé, 2007).
    Page 3, “Arabic Word Segmentation Model”
  5. Using domain adaptation alone helps performance on two of the three datasets (with a statistically insignificant decrease on broadcast news), and that our additional features further improve
    Page 3, “Experiments”
  6. We sampled 100 errors randomly from all errors made by our final model (trained on all three datasets with domain adaptation and additional features) on the ARZ development set; see Table 4.
    Page 4, “Error Analysis”
  7. In this paper we demonstrate substantial gains on Arabic clitic segmentation for both formal and dialectal text using a single model with dialect-independent features and a simple domain adaptation strategy.
    Page 5, “Error Analysis”
  8. However, as data for other Arabic dialects and genres becomes available, we expect that the model’s simplicity and the domain adaptation method we use will allow the system to be applied to these dialects with minimal effort and without a loss of performance in the original domains.
    Page 5, “Error Analysis”

See all papers in Proc. ACL 2014 that mention domain adaptation.

See all papers in Proc. ACL that mention domain adaptation.

Back to top.

development set

Appears in 5 sentences as: development set (4) development sets (1)
In Word Segmentation of Informal Arabic with Domain Adaptation
  1. F1 scores provide a more informative assessment of performance than word-level or character-level accuracy scores, as over 80% of tokens in the development sets consist of only one segment, with an average of one segmentation every 4.7 tokens (or one every 20.4 characters).
    Page 3, “Experiments”
  2. Table 1 contains results on the development set for the model of Green and DeNero and our improvements.
    Page 3, “Experiments”
  3. We sampled 100 errors randomly from all errors made by our final model (trained on all three datasets with domain adaptation and additional features) on the ARZ development set ; see Table 4.
    Page 4, “Error Analysis”
  4. Table 4: Counts of error categories (out of 100 randomly sampled ARZ development set errors).
    Page 5, “Error Analysis”
  5. One example of this distinction that appeared in the development set is the pair any)» mawdm“‘my topic” (yo madeZ< + 6.
    Page 5, “Error Analysis”

See all papers in Proc. ACL 2014 that mention development set.

See all papers in Proc. ACL that mention development set.

Back to top.

feature set

Appears in 4 sentences as: feature set (4)
In Word Segmentation of Informal Arabic with Domain Adaptation
  1. This feature set also allows the model to take into account other interactions between the beginning and end of a word, particularly those involving the definite article Jl al-.
    Page 2, “Arabic Word Segmentation Model”
  2. A notable property of this feature set is that it remains highly dialect-agnostic, even though our additional features were chosen in response to errors made on text in Egyptian dialect.
    Page 2, “Arabic Word Segmentation Model”
  3. 0 errors that can be fixed with a fuller analysis of just the problematic token, and therefore represent a deficiency in the feature set ; and
    Page 4, “Error Analysis”
  4. In 36 of the 100 sampled errors, we conjecture that the presence of the error indicates a shortcoming of the feature set , resulting in segmentations that make sense locally but are not plausible given the full token.
    Page 5, “Error Analysis”

See all papers in Proc. ACL 2014 that mention feature set.

See all papers in Proc. ACL that mention feature set.

Back to top.

segmentations

Appears in 4 sentences as: segmentations (3) segmenters (1)
In Word Segmentation of Informal Arabic with Domain Adaptation
  1. However, state-of-the-art Arabic word segmenters are either limited to formal Modern Standard Arabic, performing poorly on Arabic text featuring dialectal vocabulary and grammar, or rely on linguistic knowledge that is hand-tuned for each dialect.
    Page 1, “Abstract”
  2. Some incorrect segmentations produced by the original system could be ruled out with the knowledge of these statistics.
    Page 2, “Arabic Word Segmentation Model”
  3. In 36 of the 100 sampled errors, we conjecture that the presence of the error indicates a shortcoming of the feature set, resulting in segmentations that make sense locally but are not plausible given the full token.
    Page 5, “Error Analysis”
  4. 4.3 Context-sensitive segmentations and multiple word senses
    Page 5, “Error Analysis”

See all papers in Proc. ACL 2014 that mention segmentations.

See all papers in Proc. ACL that mention segmentations.

Back to top.

CRF

Appears in 3 sentences as: CRF (3)
In Word Segmentation of Informal Arabic with Domain Adaptation
  1. The model is an extension of the character-level conditional random field ( CRF ) model of Green and DeNero (2012).
    Page 1, “Introduction”
  2. A CRF model (Lafferty et a1., 2001) defines a distri-butionp(Y|X; 6), where X = {$1, .
    Page 2, “Arabic Word Segmentation Model”
  3. The model of Green and DeNero is a third-order (i.e., 4—gram) Markov CRF , employing the following indicator features:
    Page 2, “Arabic Word Segmentation Model”

See all papers in Proc. ACL 2014 that mention CRF.

See all papers in Proc. ACL that mention CRF.

Back to top.

evaluation metrics

Appears in 3 sentences as: Evaluation metrics (1) evaluation metrics (2)
In Word Segmentation of Informal Arabic with Domain Adaptation
  1. 3.1 Evaluation metrics
    Page 3, “Experiments”
  2. We use two evaluation metrics in our experiments.
    Page 3, “Experiments”
  3. Our segmenter achieves higher scores than MADA and MADA-ARZ on all datasets under both evaluation metrics .
    Page 4, “Experiments”

See all papers in Proc. ACL 2014 that mention evaluation metrics.

See all papers in Proc. ACL that mention evaluation metrics.

Back to top.

Treebank

Appears in 3 sentences as: Treebank (3) treebank (3)
In Word Segmentation of Informal Arabic with Domain Adaptation
  1. We train and evaluate on three corpora: parts 1—3 of the newswire Arabic Treebank (ATB),1 the Broadcast News Arabic Treebank (BN),2 and parts 1—8 of the BOLT Phase 1 Egyptian Arabic Treebank (ARZ).3 These correspond respectively to the domains in section 2.2.
    Page 3, “Experiments”
  2. We classify 7 as typos and 26 as annotation inconsistencies, although the distinction between the two is murky: typos are intentionally preserved in the treebank data, but segmentation of typos varies depending on how well they can be reconciled with standard Arabic orthography.
    Page 4, “Error Analysis”
  3. The first example is segmented in the Egyptian treebank but is left unsegmented by our system; the second is left as a single token in the treebank but is split into the above three segments by our system.
    Page 4, “Error Analysis”

See all papers in Proc. ACL 2014 that mention Treebank.

See all papers in Proc. ACL that mention Treebank.

Back to top.