Fitting the Model | We still give EM the full word/tag dictionary, but now we constrain its initial grammar model to the 459 tag bigrams identified by IP. |
Fitting the Model | In addition to removing many bad tag bigrams from the grammar, IP minimization also removes some of the good ones, leading to lower recall (EM = 0.87, IP+EM = 0.57). |
Fitting the Model | During EM training, the smaller grammar with fewer bad tag bigrams helps to restrict the dictionary model from making too many bad choices that EM made earlier. |
Small Models | That is, there exists a tag sequence that contains 459 distinct tag bigrams , and no other tag sequence contains fewer. |
Small Models | We also create variables for every possible tag bigram and word/tag dictionary entry. |
What goes wrong with EM? | We investigate the Viterbi tag sequence generated by EM training and count how many distinct tag bigrams there are in that sequence. |
Experiments | Bigram and trigram performances are similar for Chinese, but trigram performs better for Japanese. |
Experiments | In fact, although the difference in perplexity per character is not so large, the perplexity per word is radically reduced: 439.8 ( bigram ) to 190.1 (trigram). |
Inference | Furthermore, it has an inherent limitation that it cannot deal with larger than bigrams , because it uses only local statistics between directly contiguous words for word segmentation. |
Inference | It has an additional advantage in that we can accommodate higher-order relationships than bigrams , particularly trigrams, for word segmentation. |
Inference | For this purpose, we maintain a forward variable a[t] in the bigram case. |
Introduction | Crucially, since they rely on sampling a word boundary between two neighboring words, they can leverage only up to bigram word dependencies. |
Pitman-Yor process and n-gram models | The bigram distribution G2 = { |
A Syntax Free Sequence-oriented Sentence Compression Method | 1 if 1(yj) = (yj—1)+ 1 APLM Bigram (w71(yj)71(yj—1)) (5) otherwise |
A Syntax Free Sequence-oriented Sentence Compression Method | Here, 0 g APLM g l, Bigram(-) indicates word bigram probability. |
A Syntax Free Sequence-oriented Sentence Compression Method | The first line of equation (5) agrees with Jing’s observation on sentence alignment tasks (Jing and McKeown, 1999); that is, most (or almost all) bigrams in a compressed sentence appear in the original sentence as they are. |
Experimental Evaluation | For example, label ‘w/o IPTW + Dep’ employs IDF term weighting as function and word bigram, part-of-speech bigram and dependency probability between words as function in equation (1). |
Results and Discussion | Replacing PLM with the bigram language model (w/o PLM) degrades the performance significantly. |
Results and Discussion | Most bigrams in a compressed sentence followed those in the source sentence. |
Results and Discussion | PLM is similar to dependency probability in that both features emphasize word pairs that occurred as bigrams in the source sentence. |
Training method | We broadly classify features into two categories: unigram and bigram features. |
Training method | TBO (TE (w_1)) T31 <TE(w—1)7P0> T32 <TE(w—1),P—1,P0> TB3 (TE(w_1),TB(w0)) <TE(w—1)7T3(w0)7p0> <TE(w—1)7p—17T3(w0)> T36 <TE(w—1)7p—17TB(w0)7p0> CBO <p_1, p0) otherwise Table 3: Bigram features. |
Training method | Bigram features: Table 3 shows our bigram features. |
Experiments | Bigram based reordering. |
Experiments | First we consider a bigram Language Model and the algorithms try to find the reordering that maximizes the LM score. |
Experiments | This means that, when using a bigram language model, it is often possible to reorder the words of a randomly permuted reference sentence in such a way that the LM score of the reordered sentence is larger than the LM of the reference. |
Phrase-based Decoding as TSP | o The language model cost of producing the target words of 19’ right after the target words of b; with a bigram language model, this cost can be precomputed directly from b and b’. |
Phrase-based Decoding as TSP | This restriction to bigram models will be removed in Section 4.1. |
Phrase-based Decoding as TSP | 4.1 From Bigram to N-gram LM |
Experimental Results | Moreover, a bigram (i.e., “2gram”) achieves the best BLEU scores among the four different orders of VMs. |
Experimental Results | This is necessarily true, but it is interesting to see that most of the improvement is obtained just by moving from a unigram to a bigram model. |
Experimental Results | Indeed, although Table 3 shows that better approximations can be obtained by using higher-order models, the best BLEU score in Tables 2a and 2c was obtained by the bigram model. |
Variational Approximate Decoding | However, because q* only approximates p, y* of (13) may be locally appropriate but globally inadequate as a translation of c. Observe, e. g., that an n-gram model q* will tend to favor short strings y, regardless of the length of c. Suppose cc 2 le chat chasse la souris (“the cat chases the mouse”) and q* is a bigram approximation to p(y | Presumably q* (the | START), q* (mouse | the), and q*(END | mouse) are all large in So the most probable string y* under q* may be simply “the mouse,” which is short and has a high probability but fails to cover cc. |
Experiments | Phrase extraction and indexing: We evaluate our proposed method on two different types of phrases: syntactic head-modifier pairs (syntactic phrases) and simple bigram phrases (statistical phrases). |
Experiments | Since our method is not limited to a particular type of phrases, we have also conducted experiments on statistical phrases ( bigrams ) with a reduced set of features directed applicable; RMO, RSO, PD5, DF, and CPP; the features requiring linguistic preprocessing (e. g. PPT) are not used, because it is unrealistic to use them under bigram—based retrieval setting. |
Experiments | 5In most cases, the distance between words in a bigram is l, but sometimes, it could be more than 1 because of the effect of stopword removal. |
Computing Feature Expectations | Hyper-edges (boxes) are annotated with normalized transition probabilities, as well as the bigrams produced by each rule application. |
Computing Feature Expectations | The expected count of the bigram “man with” is the sum of posterior probabilities of the two hyper-edges that produce it. |
Computing Feature Expectations | 5For example, decoding under a variational approximation to the model’s posterior that decomposes over bigram probabilities is equivalent to fast consensus decoding with |