Probabilistic generative model | If a generative model is fully parameterised it can be reversed to find the underlying word decomposition by forming the conditional probability distribution Pr(Y |X |
Probabilistic generative model | The first component of the equation above is the probability distribution over non-/boundaries Pr(bji). |
Probabilistic generative model | We assume that a boundary in i is inserted independently from other boundaries (zero-order) and the graphemic representation of the word, however, is conditioned on the length of the word m j which means that the probability distribution is in fact Pr(bji|mj). |
Learning | Given the expected counts, we now need to normalize them to ensure that the transducer represents a conditional probability distribution (Eisner, 2002; Oncina and Sebban, 2006). |
Message Approximation | An alternative approach might be to simply treat messages as unnormalized probability distributions , and to minimize the KL divergence be- |
Message Approximation | tween some approximating message mm) and the true message However, messages are not always probability distributions and — because the number of possible strings is in principle infinite —they need not sum to a finite number.5 Instead, we propose to minimize the KL divergence between the “expected” marginal distribution and the approximated “expected” marginal distribution: |
Message Approximation | The procedure for calculating these statistics is described in Li and Eisner (2009), which amounts to using an expectation semiring (Eisner, 2001) to compute expected transitions in 7' o 71* under the probability distribution 7' o ,u. |
Background | Estimating a conditional probability distribution gbk; = p( as a context profile for each 212,- falls into this case. |
Background | When the context profiles are probability distributions, we usually utilize the measures on probability distributions such as the Jensen-Shannon (J S) divergence to calculate similarities (Dagan et al., 1994; Dagan et al., 1997). |
Background | The BC is also a similarity measure on probability distributions and is suitable for our purposes as we describe in the next section. |
Background | The C&C supertagger is similar to the Ratnaparkhi (1996) tagger, using features based on words and POS tags in a five-word window surrounding the target word, and defining a local probability distribution over supertags for each word in the sentence, given the previous two supertags. |
Background | Alternatively the Forward-Backward algorithm can be used to efficiently sum over all sequences, giving a probability distribution over supertags for each word which is conditional only on the input sentence. |
Results | Note that these are all alternative methods for estimating the local log-linear probability distributions used by the Ratnaparkhi-style tagger. |