Unsupervised Morphology-Based Vocabulary Expansion
Rasooli, Mohammad Sadegh and Lippincott, Thomas and Habash, Nizar and Rambow, Owen

Article Structure

Abstract

We present a novel way of generating unseen words, which is useful for certain applications such as automatic speech recognition or optical character recognition in low-resource languages.

Introduction

In many applications in human language technologies (HLT), the goal is to generate text in a target language, using its standard orthography.

Related Work

Approaches to Morphological Modeling Computational morphology is a very active area of research with a multitude of approaches that vary in the degree of manual annotation needed, and the amount of machine learning used.

Morphology-based Vocabulary Expansion

3.1 Approach

Evaluation

4.1 Evaluation Data and Tools

Conclusion and Future Work

We have presented an approach to generating new words.

Topics

reranking

Appears in 18 sentences as: rerank (2) Reranked (1) reranked (1) Reranking (5) reranking (12) reranks (2)
In Unsupervised Morphology-Based Vocabulary Expansion
  1. Reranking Models Given that the size of the expanded vocabulary can be quite large and it may include a lot of over-generation, we rerank the expanded set of words before taking the top n words to use in downstream processes.
    Page 3, “Morphology-based Vocabulary Expansion”
  2. We consider four reranking conditions which we describe below.
    Page 3, “Morphology-based Vocabulary Expansion”
  3. Reranked Expansion
    Page 3, “Morphology-based Vocabulary Expansion”
  4. 3.3 Reranking Techniques
    Page 4, “Morphology-based Vocabulary Expansion”
  5. We limit the number of V0-cabulary expansion using different thresholds after reranking or reweighing the WFSTs generated
    Page 4, “Morphology-based Vocabulary Expansion”
  6. We consider four reranking conditions.
    Page 5, “Morphology-based Vocabulary Expansion”
  7. 3.3.1 No Reranking (NRR)
    Page 5, “Morphology-based Vocabulary Expansion”
  8. The baseline reranking option is no reranking (NRR).
    Page 5, “Morphology-based Vocabulary Expansion”
  9. 3.3.3 Trigraph-based Reranking (TRR)
    Page 5, “Morphology-based Vocabulary Expansion”
  10. Instead of reweighting the WFST, we get the n-best list of generated words and rerank them using their trigraph probabilities.
    Page 5, “Morphology-based Vocabulary Expansion”
  11. 3.3.4 Reranking Morpheme Boundaries (BRR)
    Page 5, “Morphology-based Vocabulary Expansion”

See all papers in Proc. ACL 2014 that mention reranking.

See all papers in Proc. ACL that mention reranking.

Back to top.

Bigram

Appears in 8 sentences as: Bigram (8) bigram (1)
In Unsupervised Morphology-Based Vocabulary Expansion
  1. We use two different models of morphology expansion in this paper: Fixed Affix model and Bigram Affix model.
    Page 3, “Morphology-based Vocabulary Expansion”
  2. 3.2.2 Bigram Affix Expansion Model
    Page 4, “Morphology-based Vocabulary Expansion”
  3. In the Bigram Affix model, we do the same for the stem as in the Fixed Affix model, but for prefixes and suffixes, we create a bigram language model in the finite state machine.
    Page 4, “Morphology-based Vocabulary Expansion”
  4. However, this word can be generated in the Bigram Affix model as shown in Figure 2c: there is a path passing 0 —> 4 —> 1 —> 2 —> 5 —> 6 —> 3 in the FST that can produce this word.
    Page 4, “Morphology-based Vocabulary Expansion”
  5. We reweight the weights in the WFST model (Fixed or Bigram ) by composing it with a letter trigraph language model (WoTr).
    Page 5, “Morphology-based Vocabulary Expansion”
  6. The last reranking technique reranks the n-best generated word list with trigraphs that are incident on the morpheme boundaries (in case of Bigram Affix model, the last prefix and first suffix).
    Page 5, “Morphology-based Vocabulary Expansion”
  7. To our surprise, the Fixed Affix model does a slightly better job in reducing out of vocabulary than the Bigram Affix model.
    Page 6, “Evaluation”
  8. WoTr 24.21 Bigram Affix Model TRR 25.
    Page 9, “Evaluation”

See all papers in Proc. ACL 2014 that mention Bigram.

See all papers in Proc. ACL that mention Bigram.

Back to top.

morphological analyzer

Appears in 8 sentences as: morphological analyzer (7) morphological analyzers (1)
In Unsupervised Morphology-Based Vocabulary Expansion
  1. For low-resource languages, resources such as morphological analyzers are not usually available, and even good scholarly descriptions of the morphology (from which a tool could be built) are often not available.
    Page 1, “Introduction”
  2. Word Generation Tools and Settings For unsupervised learning of morphology, we use Morfessor CAT-MAP (V. 0.9.2) which was shown to be a very accurate morphological analyzer for morphologically rich languages (Creutz and Lagus, 2007).
    Page 5, “Evaluation”
  3. and thus we also have a morphological analyzer that can give all possible segmentations for a given word.
    Page 6, “Evaluation”
  4. By running the morphological analyzer on the OOVs, we can have the potential upper bound of OOV reduction by the system (labeled “oo” in Tables 2 and 3).
    Page 6, “Evaluation”
  5. Error Analysis on Turkish Unfortunately for most languages we could not find an available rule-based or supervised morphological analyzer to verify the words generated by our model.
    Page 8, “Evaluation”
  6. The only available tool for us is a Turkish finite-state morphological analyzer (Oflazer, 1996) implemented with the Xerox FST toolkit (Beesley and Karttunen, 2003).
    Page 8, “Evaluation”
  7. Another observation is that the recognition percentage of the morphological analyzer on INV words is much higher than on OOVs, which shows that OOVs in Turkish dataset are much harder to analyze.
    Page 8, “Evaluation”
  8. Table 5: Results from running a handcrafted Turkish morphological analyzer (Oflazer, 1996) on different expansions and on the development set.
    Page 9, “Evaluation”

See all papers in Proc. ACL 2014 that mention morphological analyzer.

See all papers in Proc. ACL that mention morphological analyzer.

Back to top.

language model

Appears in 5 sentences as: language model (3) language modeling (1) language models (1)
In Unsupervised Morphology-Based Vocabulary Expansion
  1. The best-performing systems for these applications today rely on training on large amounts of data: in the case of ASR, the data is aligned audio and transcription, plus large unannotated data for the language modeling ; in the case of OCR, it is transcribed optical data; in the case of MT, it is aligned bitexts.
    Page 1, “Introduction”
  2. For ASR and OCR, which can compose words from smaller units (phones or graphically recognized letters), an expanded target language vocabulary can be directly exploited without the need for changing the technology at all: the new words need to be inserted into the relevant resources (lexicon, language model ) etc, with appropriately estimated probabilities.
    Page 2, “Introduction”
  3. The expanded word combinations can be used to extend the language models used for MT to bias against incoherent hypothesized new sequences of segmented words.
    Page 2, “Introduction”
  4. In the Bigram Affix model, we do the same for the stem as in the Fixed Affix model, but for prefixes and suffixes, we create a bigram language model in the finite state machine.
    Page 4, “Morphology-based Vocabulary Expansion”
  5. We reweight the weights in the WFST model (Fixed or Bigram) by composing it with a letter trigraph language model (WoTr).
    Page 5, “Morphology-based Vocabulary Expansion”

See all papers in Proc. ACL 2014 that mention language model.

See all papers in Proc. ACL that mention language model.

Back to top.

best result

Appears in 3 sentences as: best result (2) best results (1)
In Unsupervised Morphology-Based Vocabulary Expansion
  1. In our best result (on Assamese), our approach can predict 29% of the token-based out-of-vocabulary with a small amount of unlabeled training data.
    Page 1, “Abstract”
  2. In our best result (on Assamese), we show that our approach can predict 29% of the token-based out-of-vocabulary with a small amount of unlabeled training data.
    Page 2, “Introduction”
  3. The best results (again, except for Pashto) are achieved using one of the three reranking methods (reranking by trigraph probabilities or morpheme boundaries) as opposed to doing no reranking.
    Page 6, “Evaluation”

See all papers in Proc. ACL 2014 that mention best result.

See all papers in Proc. ACL that mention best result.

Back to top.

development set

Appears in 3 sentences as: development set (3)
In Unsupervised Morphology-Based Vocabulary Expansion
  1. We sampled data from the training and development set of the Persian dependency treebank (Rasooli et al., 2013) to create a comparable seventh dataset in Persian.
    Page 5, “Evaluation”
  2. 00 is the upper-bound OOV reduction for our expansion model: for each word in the development set , we ask if our model, without any vocabulary size restriction at all, could generate it.
    Page 6, “Evaluation”
  3. Table 5: Results from running a handcrafted Turkish morphological analyzer (Oflazer, 1996) on different expansions and on the development set .
    Page 9, “Evaluation”

See all papers in Proc. ACL 2014 that mention development set.

See all papers in Proc. ACL that mention development set.

Back to top.