Enhanced Word Decomposition by Calibrating the Decision Threshold of Probabilistic Models and Using a Model Ensemble
Spiegler, Sebastian and Flach, Peter A.

Article Structure

Abstract

This paper demonstrates that the use of ensemble methods and carefully calibrating the decision threshold can significantly improve the performance of machine learning methods for morphological word decomposition.

Introduction

Words are often considered as the smallest unit of a language when examining the grammatical structure or the meaning of sentences, referred to as syntax and semantics, however, words themselves possess an internal structure denominated by the term word morphology.

Probabilistic generative model

Intuitively, we could say that our models describe the process of word generation from the left to the right by alternately using two dice, the first for deciding whether to place a morpheme boundary in the current word position and the second to get a corresponding letter transition.

Experiments and Results

In the Morpho Challenge 2009, PROMODES achieved competitive results on Finnish, Turkish, English and German — and scored highest on non-vowelized and vowelized Arabic compared to 9 other algorithms (Kurimo et al., 2009).

Related work

We have presented two probabilistic generative models for word decomposition, PROMODES and PROMODES-H. Another generative model for morphological analysis has been described by Snover and Brent (2001) and Snover et al.

Conclusions

We have presented a method to learn a calibrated decision threshold from a validation set and demonstrated that ensemble methods in connection with calibrated decision thresholds can give better results than the individual models themselves.

Topics

f-measure

Appears in 11 sentences as: F-measure (2) f-measure (9)
In Enhanced Word Decomposition by Calibrating the Decision Threshold of Probabilistic Models and Using a Model Ensemble
  1. F-measure
    Page 4, “Experiments and Results”
  2. F-measure
    Page 4, “Experiments and Results”
  3. Since both algorithms show different behaviour with increasing experience and PROMODES-H yields a higher f-measure across all datasets, we will investigate in the next experiments how these differences manifest themselves at the boundary level.
    Page 4, “Experiments and Results”
  4. We want to point out that the threshold which yields the best f-measure result on the validation set returns almost the same result on the separate test set for both algorithms which suggests the existence of a general optimal threshold.
    Page 5, “Experiments and Results”
  5. In addition to the PR curves, we plotted isometrics for corresponding f-measure values which are defined as precision=% and are hyperboles.
    Page 5, “Experiments and Results”
  6. For increasing f-measure values the isometrics are moving further to the top-right corner of the plot.
    Page 5, “Experiments and Results”
  7. PROMODES has its optimal threshold at h* = 0.36 and PROMODES-H at h* = 0.37 where PROMODES has a slightly higher f-measure than PROMODES-H.
    Page 5, “Experiments and Results”
  8. The points of optimal f-measure performance are marked with ‘A’ on the PR curve.
    Page 5, “Experiments and Results”
  9. The calibrated threshold improves the f-measure over both PROMODES and PROMODES-H.
    Page 6, “Experiments and Results”
  10. Compared to its components PROMODES and PROMODES-H the f-measure increased by 0.0228 and 0.0353 on the test set.
    Page 6, “Experiments and Results”
  11. At an optimal decision threshold, however, both yielded a similar f-measure result.
    Page 7, “Conclusions”

See all papers in Proc. ACL 2010 that mention f-measure.

See all papers in Proc. ACL that mention f-measure.

Back to top.

morphological analysis

Appears in 9 sentences as: morphological analysers (1) morphological analysis (8)
In Enhanced Word Decomposition by Calibrating the Decision Threshold of Probabilistic Models and Using a Model Ensemble
  1. The first algorithm PROMODES, which participated in the Morpho Challenge 2009 (an intema-tional competition for unsupervised morphological analysis ) employs a lower order model whereas the second algorithm PROMODES-H is a novel development of the first using a higher order model.
    Page 1, “Abstract”
  2. This study is called morphological analysis .
    Page 1, “Introduction”
  3. four tasks are assigned to morphological analysis : word decomposition into morphemes, building morpheme dictionaries, defining morphosyn-tactical rules which state how morphemes can be combined to valid words and defining mor-phophonological rules that specify phonological changes morphemes undergo when they are combined to words.
    Page 1, “Introduction”
  4. Results of morphological analysis are applied in speech synthesis (Sproat, 1996) and recognition (Hirsimaki et al., 2006), machine translation (Amtrup, 2003) and information retrieval (Kettunen, 2009).
    Page 1, “Introduction”
  5. In the past years, there has been a lot of interest and activity in the development of algorithms for morphological analysis .
    Page 1, “Introduction”
  6. If the data for training the model has the same structure as the desired output of the morphological analysis , in other words, if a morphological model is learnt from labelled data, the algorithm is classified under supervised learning.
    Page 1, “Introduction”
  7. Unsupervised algorithms for morphological analysis are Lin-guistica (Goldsmith, 2001), Morfessor (Creutz, 2006) and Paramor (Monson, 2008).
    Page 1, “Introduction”
  8. We have presented two probabilistic generative models for word decomposition, PROMODES and PROMODES-H. Another generative model for morphological analysis has been described by Snover and Brent (2001) and Snover et al.
    Page 7, “Related work”
  9. Combining different morphological analysers has been performed, for example, by Atwell and Roberts (2006) and Spiegler et al.
    Page 7, “Related work”

See all papers in Proc. ACL 2010 that mention morphological analysis.

See all papers in Proc. ACL that mention morphological analysis.

Back to top.

probability distribution

Appears in 6 sentences as: probability distribution (6)
In Enhanced Word Decomposition by Calibrating the Decision Threshold of Probabilistic Models and Using a Model Ensemble
  1. If a generative model is fully parameterised it can be reversed to find the underlying word decomposition by forming the conditional probability distribution Pr(Y |X
    Page 2, “Probabilistic generative model”
  2. The first component of the equation above is the probability distribution over non-/boundaries Pr(bji).
    Page 2, “Probabilistic generative model”
  3. We assume that a boundary in i is inserted independently from other boundaries (zero-order) and the graphemic representation of the word, however, is conditioned on the length of the word m j which means that the probability distribution is in fact Pr(bji|mj).
    Page 2, “Probabilistic generative model”
  4. The second component is the letter transition probability distribution Pr(tji|bji).
    Page 2, “Probabilistic generative model”
  5. The first component, the probability distribution over now/boundaries Pr(b ji|b 17-1), satisfies 2:20Pr(bfi=r|bj,_1)=1 with bj,,-1,bj,- 6 {0,1}.
    Page 3, “Probabilistic generative model”
  6. The second component, the letter transition probability distribution Pr(tji|bji,bj7i-1), fulfils ElfieAPr(tji|bji,tj7i_1,bj7i_1)=l tji b63ng a transition from a certain lj7i_1 E A g3 to lji.
    Page 3, “Probabilistic generative model”

See all papers in Proc. ACL 2010 that mention probability distribution.

See all papers in Proc. ACL that mention probability distribution.

Back to top.

probabilistic models

Appears in 6 sentences as: PRObabilistic Model (1) probabilistic model (1) probabilistic models (4)
In Enhanced Word Decomposition by Calibrating the Decision Threshold of Probabilistic Models and Using a Model Ensemble
  1. We employ two algorithms which come from a family of generative probabilistic models .
    Page 1, “Abstract”
  2. The purpose of this paper is an analysis of the underlying probabilistic models and the types of errors committed by each one.
    Page 2, “Introduction”
  3. 1PROMODES stands for PRObabilistic Model for different DEgrees of Supervision.
    Page 2, “Probabilistic generative model”
  4. For this reason, we have extended the model which led to PROMODES-H, a higher-order probabilistic model .
    Page 3, “Probabilistic generative model”
  5. Moreover, our probabilistic models seem to resemble Hidden Markov Models (HMMs) by having certain states and transitions.
    Page 7, “Related work”
  6. We introduced two algorithms for word decomposition which are based on generative probabilistic models .
    Page 7, “Conclusions”

See all papers in Proc. ACL 2010 that mention probabilistic models.

See all papers in Proc. ACL that mention probabilistic models.

Back to top.

generative process

Appears in 3 sentences as: generative process (3)
In Enhanced Word Decomposition by Calibrating the Decision Threshold of Probabilistic Models and Using a Model Ensemble
  1. In Section 2 we introduce the probabilistic generative process and show in Sections 2.1 and 2.2 how we incorporate this process in PROMODES and PROMODES-H. We start our experiments with examining the learning behaviour of the algorithms in 31.
    Page 2, “Introduction”
  2. abilistic generative process consisting of words as observed variables X and their hidden segmentation as latent variables Y.
    Page 2, “Probabilistic generative model”
  3. (2002), however, they were interested in finding paradigms as sets of mutual exclusive operations on a word form whereas we are describing a generative process using morpheme boundaries and resulting letter transitions.
    Page 7, “Related work”

See all papers in Proc. ACL 2010 that mention generative process.

See all papers in Proc. ACL that mention generative process.

Back to top.

precision and recall

Appears in 3 sentences as: precision and recall (3)
In Enhanced Word Decomposition by Calibrating the Decision Threshold of Probabilistic Models and Using a Model Ensemble
  1. It seems that the model gets quickly saturated in terms of incorporating new information and therefore precision and recall do not drastically change for increasing dataset sizes.
    Page 4, “Experiments and Results”
  2. For this reason we broke down the summary measures of precision and recall into their original components: true/false positive (TP/FF) and negative (TN/FN) counts presented in the 2 x 2 contingency table of Figure 1.
    Page 4, “Experiments and Results”
  3. The optimal solution applying [1* = 0.38 is more balanced between precision and recall and
    Page 6, “Experiments and Results”

See all papers in Proc. ACL 2010 that mention precision and recall.

See all papers in Proc. ACL that mention precision and recall.

Back to top.