Computing Feature Expectations | Un-igrams are produced by lexical rules, while higher-order n-grams can be produced either directly by lexical rules, or by combining constituents. |
Computing Feature Expectations | The n-gram language model score of e similarly decomposes over the h in e that produce n-grams . |
Computing Feature Expectations | The linear similarity measure takes the following form, where Tn is the set of n-grams: |
Experimental Results | Above, we compare the precision, relative to reference translations, of sets of n-grams chosen in two ways. |
Experimental Results | The left bar is the precision of the n-grams in 6*. |
Experimental Results | The right bar is the precision of n-grams with E[c(6, 75)] > p. To justify this comparison, we chose p so that both methods of choosing 71- grams gave the same n-gram recall: the fraction of n-grams in reference translations that also appeared in 6* or had E[6(6, 75)] > p. |
MERT for MBR Parameter Optimization | This linear function contains n + 1 parameters 60, 61, ..., 6 N, where N is the maximum order of the n-grams involved. |
Minimum Bayes-Risk Decoding | First, the set of n-grams is extracted from the lattice. |
Minimum Bayes-Risk Decoding | For a moderately large lattice, there can be several thousands of n-grams and the procedure becomes expensive. |
Minimum Bayes-Risk Decoding | For each node 75 in the lattice, we maintain a quantity Score(w, t) for each n-gram 212 that lies on a path from the source node to t. Score(w, t) is the highest posterior probability among all edges on the paths that terminate on t and contain n-gram w. The forward pass requires computing the n-grams introduced by each edge; to do this, we propagate n-grams (up to maximum order — l) terminating on each node. |
Collaborative Decoding | Here we do not discriminate among different lexical n-grams and are only concerned with statistics aggregation of all n-grams of the same order. |
Collaborative Decoding | Gn+(e,e') is the n-gram agreement measure function which counts the number of occurrences in e'of n-grams in 6. |
Collaborative Decoding | So the corresponding feature value will be the expected number of occurrences in 17-[k (f) of all n-grams in e: |
Discussion | Our method uses agreement information of n-grams , and consensus features are integrated into decoding models. |
Experiments | In Table 5 we show in another dimension the impact of consensus-based features by restricting the maximum order of n-grams used to compute agreement statistics. |
Experiments | One reason could be that the data sparsity for high-order n-grams leads to over fitting on development data. |
Variational Approximate Decoding | whose edges correspond to n-grams (weighted with negative log-probabilities) and whose vertices correspond to (n — l)-grams. |
Variational Approximate Decoding | This may be regarded as favoring n-grams that are likely to appear in the reference translation (because they are likely in the derivation forest). |
Variational Approximate Decoding | However, in order to score well on the BLEU metric for MT evaluation (Papineni et al., 2001), which gives partial credit, we would also like to favor lower-order n-grams that are likely to appear in the reference, even if this means picking some less-likely high-order n-grams . |
Variational vs. Min-Risk Decoding | Now, let us divide N, which contains n-gram types of different n, into several subsets Wn, each of which contains only the n-grams with a given length n. We can now rewrite (19) as follows, |
Introduction | Rank OIflGR using words —-- ——onte‘Carfo usin - rams --» ---d/x 7,‘ Monte Carlo using words w ~ 7 ~ ’ Relative Entropy using N-grams — —~—-’ IRelative EIntropy usin words If r r w |
Related Work | The n-gram based approaches are based on the counts of character or byte n-grams , which are sequences of n characters or bytes, extracted from a corpus for each reference language. |
Related Work | (Dunning, 1994) proposed a system that uses Markov Chains of byte n-grams with Bayesian Decision Rules to minimize the probability error. |
Introduction | (Miller et al., 1999; Song and Croft, 1999) explore the use n-grams in retrieval models. |
Previous Work | Subsequently, various types of phrases, such as sequential n-grams (Mitra et al., 1997), head-modifier pairs extracted from syntactic structures (Lewis and Croft, 1990; Zhai, 1997; Dillon and Gray, 1983; Strzalkowski et al., 1994), proximity-based phrases (Turpin and Mof—fat, 1999), were examined with conventional retrieval models (e. g. vector space model). |
Previous Work | (Song and Croft, 1999; Miller et al., 1999; Gao et al., 2004; Metzler and Croft, 2005) investigated the effectiveness of language modeling approach in modeling statistical phrases such as n-grams or proximity-based phrases. |