Abstract | However, n-gram based metrics are still today the dominant approach. |
Alternatives to Correlation-based Meta-evaluation | Therefore, we have focused on other aspects of metric reliability that have revealed differences between n-gram and linguistic based metrics: |
Alternatives to Correlation-based Meta-evaluation | All n-gram based metrics achieve SIP and SIR values between 0.8 and 0.9. |
Alternatives to Correlation-based Meta-evaluation | This result suggests that n-gram based metrics are reasonably reliable for this purpose. |
Correlation with Human Judgements | Let us first analyze the correlation with human judgements for linguistic vs. n-gram based metrics. |
Correlation with Human Judgements | Linguistic metrics are represented by grey plots, and black plots represent metrics based on n-gram overlap. |
Correlation with Human Judgements | Therefore, we need additional meta-evaluation criteria in order to clarify the behavior of linguistic metrics as compared to n-gram based metrics. |
Introduction | However, the most commonly used metrics are still based on n-gram matching. |
Introduction | For that purpose, we compare — using four different test beds — the performance of 16 n-gram based metrics, 48 linguistic metrics and one combined metric from the state of the art. |
Previous Work on Machine Translation Meta-Evaluation | All that metrics were based on n-gram overlap. |
Composite language model | The n-gram language model is essentially a word predictor that given its entire document history it predicts next word wk+1 based on the last n-l words with probability p(wk+1|w,’:_n+2) where w’g_n+2 = wk—n+27'°' 710k:- |
Composite language model | The SLM (Chelba and J elinek, 1998; Chelba and Jelinek, 2000) uses syntactic information beyond the regular n-gram models to capture sentence level long range dependencies. |
Composite language model | When combining n-gram , m order SLM and |
Experimental results | The m-SLM performs competitively with its counterpart n-gram (n=m+1) on large scale corpus. |
Introduction | The Markov chain ( n-gram ) source models, which predict each word on the basis of previous n-l words, have been the workhorses of state-of—the-art speech recognizers and machine translators that help to resolve acoustic or foreign language ambiguities by placing higher probability on more likely original underlying word strings. |
Introduction | are efficient at encoding local word interactions, the n-gram model clearly ignores the rich syntactic and semantic structures that constrain natural languages. |
Introduction | (2006) integrated n-gram , structured language model (SLM) (Chelba and Jelinek, 2000) and probabilistic latent semantic analysis (PLSA) (Hofmann, 2001) under the directed MRF framework (Wang et al., 2005) and studied the stochastic properties for the composite language model. |
Training algorithm | Similar to SLM (Chelba and Jelinek, 2000), we adopt an N -best list approximate EM re-estimation with modular modifications to seamlessly incorporate the effect of n-gram and PLSA components. |
Training algorithm | This implies that when computing the language model probability of a sentence in a client, all servers need to be contacted for each n-gram request. |
Training algorithm | (2007) follows a standard MapReduce paradigm (Dean and Ghemawat, 2004): the corpus is first divided and loaded into a number of clients, and n-gram counts are collected at each client, then the n-gram counts mapped and stored in a number of servers, resulting in exactly one server being contacted per n-gram when computing the language model probability of a sentence. |
Introduction | Another is a web-scale N-gram corpus, which is a N-gram corpus with N-grams of length 1-5 (Brants and Franz, 2006), we call it Google V1 in this paper. |
Web-Derived Selectional Preference Features | one is web, N-gram counts are approximated by Google hits. |
Web-Derived Selectional Preference Features | This N-gram corpus records how often each unique sequence of words occurs. |
Web-Derived Selectional Preference Features | times or more (1 in 25 billion) are kept, and appear in the n-gram tables. |
Abstract | Our model is also considered as a way to construct an accurate word n-gram language model directly from characters of arbitrary language, without any “wor ” indications. |
Experiments | NPY(n) means n-gram NPYLM. |
Inference | character n-gram model explicit as 9 we can set |
Inference | where p(cl---ck,k|@) is an n-gram probability given by (6), and p(l<:|@) is a probability that a word of length k will be generated from 9. |
Introduction | In this paper, we extend this work to propose a more efficient and accurate unsupervised word segmentation that will optimize the performance of the word n-gram Pitman-Yor (i.e. |
Introduction | Furthermore, it can be viewed as a method for building a high-performance n-gram language model directly from character strings of arbitrary language. |
Introduction | By embedding a character n-gram in word n-gram from a Bayesian perspective, Section 3 introduces a novel language model for word segmentation, which we call the Nested Pitman-Yor language model. |
Nested Pitman-Yor Language Model | In contrast, in this paper we use a simple but more elaborate model, that is, a character n-gram language model that also employs HPYLM. |
Nested Pitman-Yor Language Model | To avoid dependency on n-gram order n, we actually used the oo-gram language model (Mochihashi and Sumita, 2007), a variable order HPYLM, for characters. |
Pitman-Yor process and n-gram models | In this representation, each n-gram context h (including the null context 6 for unigrams) is a Chinese restaurant whose customers are the n-gram counts seated over the tables 1 - - -thw. |
Pitman-Yor process and n-gram models | As a result, the n-gram probability of this hierarchical Pitman—Yor language model (HPYLM) is recursively computed as |
Abstract | Our particular variational distributions are parameterized as n-gram models. |
Abstract | We also analytically show that interpolating these n-gram models for different n is similar to minimum-risk decoding for BLEU (Tromble et al., 2008). |
Introduction | In practice, we approximate with several different variational families Q, corresponding to n-gram (Markov) models of different orders. |
Variational Approximate Decoding | Since each q(y) is a distribution over output strings, a natural choice for Q is the family of n-gram models. |
Variational Approximate Decoding | where W is a set of n-gram types. |
Variational Approximate Decoding | Each 212 E W is an n-gram , which occurs cw(y) times in the string 3/, and 212 may be divided into an (n — l)-gram prefix (the history) and a l-gram suffix r(w) (the rightmost or current word). |
Abstract | Using an iterative decoding approach, n-gram agreement statistics between translations of multiple decoders are employed to re-rank both full and partial hypothesis explored in decoding. |
Collaborative Decoding | To compute the consensus measures, we further decompose each 61 (e, 6') into n-gram matching statistics between 6 and 6'. |
Collaborative Decoding | For each n-gram of order n, we introduce a pair of complementary consensus measure functions Gn+(e, e') and 671— (e, 6') described as follows: |
Collaborative Decoding | Gn+(e,e') is the n-gram agreement measure function which counts the number of occurrences in e'of n-grams in 6. |
Experiments | All baseline decoders are extended with n-gram consensus —based co-decoding features to construct member decoders. |
Experiments | Table 4 shows the comparison results of a two-system co-decoding using different settings of n-gram agreement and disagreement features. |
Experiments | It is clearly shown that both n-gram agreement and disagreement types of features are helpful, and using them together is the best choice. |
Introduction | Lattice MBR decoding uses a linear approximation to the BLEU score (Pap-ineni et al., 2001); the weights in this linear loss are set heuristically by assuming that n-gram pre-cisions decay exponentially with n. However, this may not be optimal in practice. |
Minimum Bayes-Risk Decoding | They approximated log(BLEU) score by a linear function of n-gram matches and candidate length. |
Minimum Bayes-Risk Decoding | where w is an n-gram present in either E or E’, and 60, 61, ..., 6 N are weights which are determined empirically, where N is the maximum n-gram order. |
Minimum Bayes-Risk Decoding | Next, the posterior probability of each n-gram is computed. |
Introduction | Drawing on earlier successes in speech recognition, research in statistical machine translation has effectively used n-gram word sequence models as language models. |
Introduction | Modern phrase-based translation using large scale n-gram language models generally performs well in terms of lexical choice, but still often produces ungrammatical output. |
Related Work | (2007) use supertag n-gram LMs. |
Related Work | An n-gram language model history is also maintained at each node in the translation lattice. |
Related Work | The search space is further trimmed with hypothesis recombination, which collapses lattice nodes that share a common coverage vector and n-gram state. |
Abstract | Our results show that summaries biased by dependency pattern models lead to significantly higher ROUGE scores than both n-gram language models reported in previous work and also Wikipedia baseline summaries. |
Discussion and Conclusion | Our evaluations show that such an approach yields summaries which score more highly than an approach which uses a simpler representation of an object type model in the form of a n-gram language model. |
Introduction | a corpus of descriptions of churches, a corpus of bridge descriptions, and so on) and reported results showing that incorporating such n-gram language models as a feature in a feature-based extractive summarizer improves the quality of automatically generated summaries. |
Introduction | The main weakness of n-gram language models is that they only capture very local information aboutshofitennsequencesandcannotnnxkfllong distance dependencies between terms. |
Introduction | If this information is expressed as in the first line of Table l, n-gram language models are likely to |
Representing conceptual models 2.1 Object type corpora | We derive n-gram language and dependency pattern models using object type corpora made available to us by Aker and Gaizauskas. |
Representing conceptual models 2.1 Object type corpora | 2.2 N-gram language models |
Representing conceptual models 2.1 Object type corpora | they calculate the probability that a sentence is generated based on a n-gram language model. |
Summarizer | o LMSim3 : The similarity of a sentence S to an n-gram language model LM (the probability that the sentence S is generated by LM). |
Decoding with the NNJ M | Because our NNJM is fundamentally an n-gram NNLM with additional source context, it can easily be integrated into any SMT decoder. |
Decoding with the NNJ M | When performing hierarchical decoding with an n-gram LM, the leftmost and rightmost n — 1 words from each constituent must be stored in the state space. |
Decoding with the NNJ M | We also train a separate lower-order n-gram model, which is necessary to compute estimate scores during hierarchical decoding. |
Introduction | Initially, these models were primarily used to create n-gram neural network language models (NNLMs) for speech recognition and machine translation (Bengio et al., 2003; Schwenk, 2010). |
Introduction | Specifically, we introduce a novel formulation for a neural network joint model (NNJ M), which augments an n-gram target language model with an m-word source window. |
Model Variations | Specifically, this means that we don’t use dependency-based rule extraction, and our decoder only contains the following MT features: (1) rule probabilities, (2) n-gram Kneser-Ney LM, (3) lexical smoothing, (4) target word count, (5) con-cat rule penalty. |
Model Variations | This does not include the cost of n-gram creation or cached lookups, which amount to ~0.03 seconds per source word in our current implementation.14 However, the n-grams created for the NNJ M can be shared with the Kneser-Ney LM, which reduces the cost of that feature. |
Model Variations | 14In our decoder, roughly 95% of NNJM n-gram lookups within the same sentence are duplicates. |
Neural Network Joint Model (NNJ M) | Formally, our model approximates the probability of target hypothesis T conditioned on source sentence S. We follow the standard n-gram LM decomposition of the target, where each target word ti is conditioned on the previous n — 1 target words. |
Neural Network Joint Model (NNJ M) | If our neural network has only one hidden layer and is self-normalized, the only remaining computation is 512 calls to tanho and a single 513-dimensional dot product for the final output score.6 Thus, only ~3500 arithmetic operations are required per n-gram lookup, compared to ~2.8M for self-normalized NNJ M without pre-computation, and ~35M for the standard NNJM.7 |
Neural Network Joint Model (NNJ M) | “lookups/sec” is the number of unique n-gram probabilities that can be computed per second. |
Introduction | One is n-gram model over different units, such as word-level bigram/trigram models (Bangalore and Rambow, 2000; Langkilde, 2000), or factored language models integrated with syntactic tags (White et al. |
Introduction | (2008) develop a general-purpose realizer couched in the framework of Lexical Functional Grammar based on simple n-gram models. |
Introduction | (2009) present a dependency-spanning tree algorithm for word ordering, which first builds dependency trees to decide linear precedence between heads and modifiers then uses an n-gram language model to order siblings. |
Log-linear Models | We linearize the dependency relations by computing n-gram models, similar to traditional word-based language models, except using the names of dependency relations instead of words. |
Log-linear Models | The dependency relation model calculates the probability of dependency relation n-gram P(DR) according to Eq.(3). |
Log-linear Models | = HP(DRk I Dle—jn k=1 Word Model: We integrate an n-gram word model into the log-linear model for capturing the relation between adjacent words. |
Computing Feature Expectations | In this section, we consider BLEU in particular, for which the relevant features gb(e) are n-gram counts up to length n = 4. |
Computing Feature Expectations | The nodes are states in the decoding process that include the span (2', j) of the sentence to be translated, the grammar symbol 3 over that span, and the left and right context words of the translation relevant for computing n-gram language model scores.3 Each hyper-edge h represents the application of a synchronous rule 7" that combines nodes corresponding to non-terminals in |
Computing Feature Expectations | Each 71- gram that appears in a translation 6 is associated with some h in its derivation: the h corresponding to the rule that produces the n-gram . |
Consensus Decoding Algorithms | In this expression, BLEU(e; 6’) references 6’ only via its n-gram count features C(e’ , t).2 2The length penalty (1 — is also a function of n- |
Introduction | The contributions of this paper include a linear-time algorithm for MBR using linear similarities, a linear-time alternative to MBR using nonlinear similarity measures, and a forest-based extension to this procedure for similarities based on n-gram counts. |
Abstract | We propose a simple generative, syntactic language model that conditions on overlapping windows of tree context (or treelets) in the same way that n-gram language models condition on overlapping windows of linear context. |
Abstract | We estimate the parameters of our model by collecting counts from automatically parsed text using standard n-gram language model estimation techniques, allowing us to train a model on over one billion tokens of data using a single machine in a matter of hours. |
Introduction | At the same time, because n-gram language models only condition on a local window of linear word-level context, they are poor models of long-range syntactic dependencies. |
Introduction | Although several lines of work have proposed generative syntactic language models that improve on n-gram models for moderate amounts of data (Chelba, 1997; Xu et al., 2002; Charniak, 2001; Hall, 2004; Roark, |
Introduction | Our model can be trained simply by collecting counts and using the same smoothing techniques normally applied to n-gram models (Kneser and Ney, 1995), enabling us to apply techniques developed for scaling 71- gram models out of the box (Brants et al., 2007; Pauls and Klein, 2011). |
Treelet Language Modeling | The common denominator of most n-gram language models is that they assign probabilities roughly according to empirical frequencies for observed 77.-grams, but fall back to distributions conditioned on smaller contexts for unobserved n-grams, as shown in Figure 1(a). |
Treelet Language Modeling | As in the n-gram case, we would like to pick h to be large enough to capture relevant dependencies, but small enough that we can obtain meaningful estimates from data. |
Treelet Language Modeling | Although it is tempting to think that we can replace the left-to-right generation of n-gram models with the purely top-down generation of typical PCFGs, in practice, words are often highly predictive of the words that follow them — indeed, n-gram models would be terrible language models if this were not the case. |
Language Identification | We implement a statistical model using a character based n-gram feature. |
Language Identification | For each language, we collect the n-gram counts (for n = l to n = 7 also using the word beginning and ending spaces) from the vocabulary of the training corpus, and then generate a probability distribution from these counts. |
Language Identification | In other words, this time we used a word-based n-gram method, only with n = 1. |
Related Work | Most of the work carried out to date on the written language identification problem consists of supervised approaches that are trained on a list of words or n-gram models for each reference language. |
Related Work | The n-gram based approaches are based on the counts of character or byte n-grams, which are sequences of n characters or bytes, extracted from a corpus for each reference language. |
Related Work | classification models that use the n-gram features have been proposed. |
Abstract | We take a multi-pass approach to machine translation decoding when using synchronous context-free grammars as the translation model and n-gram language models: the first pass uses a bigram language model, and the resulting parse forest is used in the second pass to guide search with a trigram language model. |
Decoding to Maximize BLEU | BLEU is based on n-gram precision, and since each synchronous constituent in the tree adds a new 4-gram to the translation at the point where its children are concatenated, the additional pass approximately maximizes BLEU. |
Introduction | This complexity arises from the interaction of the tree-based translation model with an n-gram language model. |
Language Model Integrated Decoding for SCFG | We begin by introducing Synchronous Context Free Grammars and their decoding algorithms when an n-gram language model is integrated into the grammatical search space. |
Language Model Integrated Decoding for SCFG | Without an n-gram language model, decoding using SCFG is not much different from CFG parsing. |
Language Model Integrated Decoding for SCFG | However, when we want to integrate an n-gram language model into the search, our goal is searching for the derivation whose total sum of weights of productions and n-gram log probabilities is maximized. |
Multi-pass LM-Integrated Decoding | We take the same view as in speech recognition that a trigram integrated model is a finer-grained model than bigram model and in general we can do an n — l-gram decoding as a predicative pass for the following n-gram pass. |
Multi-pass LM-Integrated Decoding | We can make the approximation even closer by incorporating local higher-order outside n-gram information for a state of X[z',j, u1,,,,n_1, v1,,,,n_1] into account. |
Multi-pass LM-Integrated Decoding | where 6 is the Viterbi inside cost and 04 is the Viterbi outside cost, to globally prioritize the n-gram integrated states on the agenda for exploration. |
Abstract | We present an algorithm for re-estimating parameters of backoff n-gram language models so as to preserve given marginal distributions, along the lines of well-known Kneser-Ney (1995) smoothing. |
Abstract | We present experimental results for heavily pruned backoff n-gram models, and demonstrate perplexity and word error rate reductions when used with various baseline smoothing methods. |
Introduction | Smoothed n-gram language models are the defacto standard statistical models of language for a wide range of natural language applications, including speech recognition and machine translation. |
Introduction | Such models are trained on large text corpora, by counting the frequency of n-gram collocations, then normalizing and smoothing (regularizing) the resulting multinomial distributions. |
Introduction | Briefly, the smoothing method reesti-mates lower-order n-gram parameters in order to avoid overestimating the likelihood of n-grams that already have ample probability mass allocated as part of higher-order n-grams. |
Preliminaries | N-gram language models are typically presented mathematically in terms of words 212, the strings (histories) h that precede them, and the suffixes of the histories (backoffs) h’ that are used in the smoothing recursion. |
Preliminaries | where is the count of the n-gram sequence 10H, . |
Preliminaries | N-gram language models allow for a sparse representation, so that only a subset of the possible n-grams must be explicitly stored. |
Background | The computation of t/I(e,H(v)) is based on a linear combination of a set of n-gram consensuses-based features. |
Background | For each order of n-gram, h; (e,H(v)) and h”— (e,H(v)) are defined to measure the n-gram agreement and disagreement between e and other translation candidates in H(v), respectively. |
Background | If p orders of n-gram are used in computing t//(e,H(v)) , the total number of features in the system combination will be T +2>< p (T model-score-based features defined in Equation 8 and 2x p consensus-based features defined in Equation 9). |
Abstract | The scheme can represent any standard n-gram model and is easily combined with existing model reduction techniques such as entropy-pruning. |
Introduction | LMs are usually implemented as n-gram models parameterized for each distinct sequence of up to n words observed in the training corpus. |
Introduction | Recent work (Talbot and Osborne, 2007b) has demonstrated that randomized encodings can be used to represent n-gram counts for LMs with signficant space-savings, circumventing information-theoretic constraints on lossless data structures by allowing errors with some small probability. |
Introduction | Lookup is very efficient: the values of 3 cells in a large array are combined with the fingerprint of an n-gram . |
Perfect Hash-based Language Models | The model can erroneously return a value for an n-gram that was never actually stored, but will always return the correct value for an n-gram that is in the model. |
Perfect Hash-based Language Models | Each n-gram sci is drawn from some set of possible n-grams LI and its associated value from a corresponding set of possible values V. |
Perfect Hash-based Language Models | We do not store the n-grams and their probabilities directly but rather encode a fingerprint of each n-gram f together with its associated value in such a way that the value can be retrieved when the model is queried with the n-gram sci. |
Scaling Language Models | These are typically n-gram models that approximate the probability of a word sequence by assuming each token to be independent of all but n — l preceding tokens. |
Scaling Language Models | Although n-grams observed in natural language corpora are not randomly distributed within this universe no lossless data structure that we are aware of can circumvent this space-dependency on both the n-gram order and the vocabulary size. |
Scaling Language Models | While the approach results in significant space savings, working with corpus statistics, rather than n-gram probabilities directly, is computationally less efficient (particularly in a distributed setting) and introduces a dependency on the smoothing scheme used. |
Abstract | We attempt to extract this information from history-contexts of up to ten words in size, and found it complements well the n-gram model, which inherently suffers from data scarcity in learning long histo—ry-contexts. |
Introduction | The commonly used n-gram model (Bahl et a1. |
Introduction | Although n-gram models are simple and effective, modeling long history-contexts lead to severe data scarcity problems. |
Language Modeling with TD and TO | The prior, which is usually implemented as a unigram model, can be also replaced with a higher order n-gram model as, for instance, the bigram model: |
Language Modeling with TD and TO | Replacing the unigram model with a higher order n-gram model is important to compensate the damage incurred by the conditional independence assumption made earlier. |
Motivation of the Proposed Approach | In the n-gram model, for example, these two attributes are jointly taken into account in the ordered word-sequence. |
Motivation of the Proposed Approach | Consequently, the n-gram model can only be effectively implemented within a short history-context (e. g. of size of three or four). |
Motivation of the Proposed Approach | However, intermediate distances beyond the n-gram model limits can be very useful and should not be discarded. |
Related Work | 2007) disassembles the n-gram into (n—l) word-pairs, such that each pair is modeled by a distance-k bigram model, where 1 S k s n — 1 . |
Error Classification | While we employ seven types of features (see Sections 4.2 and 4.3), only the word n-gram features are subject to feature selection.2 Specifically, we employ |
Error Classification | the top n,- n-gram features as selected according to information gain computed over the training data (see Yang and Pedersen (1997) for details). |
Error Classification | Aggregated word n-gram features. |
Evaluation | Our Baseline system, which only uses word n-gram and random indexing features, seems to perform uniformly poorly across both micro and macro F-scores (F and F; see row 1). |
Score Prediction | 6Before tuning the feature selection parameter, we have to sort the list of n-gram features occurring the training set. |
Automatic Evaluation Metrics | number of n-gram matches of the system translation against one or more reference translations. |
Automatic Evaluation Metrics | Generally, more n-gram matches result in a higher BLEU score. |
Automatic Evaluation Metrics | When determining the matches to calculate precision, BLEU uses a modified, or clipped n-gram precision. |
Metric Design Considerations | 4.1 Using N-gram Information |
Metric Design Considerations | Lemma and POS match Representing each n-gram by its sequence of lemma and POS-tag pairs, we first try to perform an exact match in both lemma and POS-tag. |
Metric Design Considerations | In all our n-gram matching, each n-gram in the system translation can only match at most one n-gram in the reference translation. |
Abstract | Our most compact representation can store all 4 billion n-grams and associated counts for the Google n-gram corpus in 23 bits per n-gram , the most compact lossless representation to date, and even more compact than recent lossy compression techniques. |
Introduction | and Raj, 2001; Hsu and Glass, 2008), our methods are conceptually based on tabular trie encodings wherein each n-gram key is stored as the concatenation of one word (here, the last) and an offset encoding the remaining words (here, the context). |
Language Model Implementations | For an n-gram language model, we can apply this implementation with a slight modification: we need n sorted arrays, one for each n-gram order. |
Language Model Implementations | Because our keys are sorted according to their context-encoded representation, we cannot straightforwardly answer queries about an n-gram w without first determining its context encoding. |
Preliminaries | Our goal in this paper is to provide data structures that map n-gram keys to values, i.e. |
Preliminaries | However, because of the sheer number of keys and values needed for n-gram language modeling, generic implementations do not work efficiently “out of the box.” In this section, we will review existing techniques for encoding the keys and values of an n-gram language model, taking care to account for every bit of memory required by each implementation. |
Preliminaries | In the WeblT corpus, the most frequent n-gram occurs about 95 billion times. |
Abstract | Each n-gram is a sequence of words, POS tags or a combination of words and POS tags |
Abstract | To illustrate, consider the following feature set, a bigram and a trigram (each term in the n-gram either has the form word or Atag): |
Abstract | The first n-gram of the set, please AVB, would match pleaseARB bringAVB from the text. |
Abstract | We compare our error rates to the state-of-the-art and to a strong Google n-gram count baseline. |
Abstract | We attain a maximum error reduction of 69.8% and average error reduction across all test sets of 59.1% compared to the state-of-the-art and a maximum error reduction of 68.4% and average error reduction across all test sets of 41.8% compared to our Google n-gram count baseline. |
Experiments | We keep a table mapping each unique n-gram to the number of times it has been seen in the training data. |
Experiments | 4.2 Google n-gram Baseline |
Experiments | The Google n-gram corpus is a collection of n-gram counts drawn from public webpages with a total of one trillion tokens — around 1 billion each of unique 3-grams, 4—grams, and 5-grams, and around 300,000 unique bigrams. |
Results | MAXENT also outperforms the GOOGLE N-GRAM baseline for almost all test corpora and sequence lengths. |
Results | For the Switchboard test corpus token and type accuracies, the GOOGLE N-GRAM baseline is more accurate than MAXENT for sequences of length 2 and overall, but the accuracy of MAXENT is competitive with that of GOOGLE N-GRAM . |
Results | MAXENT also attains a maximum error reduction of 68.4% for the WSJ test corpus and an average error reduction of 41.8% when compared to GOOGLE N-GRAM . |
Task A: Polarity Classification | 4.4 N-gram Evaluation and Results |
Task A: Polarity Classification | N-gram features are widely used in a variety of classification tasks, therefore we also use them in our polarity classification task. |
Task A: Polarity Classification | Figure 2 shows a study of the influence of the different information sources and their combination with n-gram features for English. |
Task B: Valence Prediction | The Farsi and Russian regression models are based only on n-gram features, while the English and Spanish regression models have both n-gram and LIWC features. |
Conclusion | Unlike previous approaches, our method combines information from letter n-gram language models and word dictionaries and provides a robust decipherment model. |
Decipherment | Combining letter n-gram language models with word dictionaries: Many existing probabilistic approaches use statistical letter n-gram language models of English to assign P (p) probabilities to plaintext hypotheses during decipherment. |
Decipherment | 3We set the interpolation weights for the word and n-gram LM as (0.9, 0.1). |
Decipherment | We train the letter n-gram LM on 50 million words of English text available from the Linguistic Data Consortium. |
Experiments and Results | EM Method using letter n-gram LMs following the approach of Knight et al. |
Experiments and Results | 0 Letter n-gram versus W0rd+n-gram LMs—Figure 2 shows that using a word+3-gram LM instead of a 3-gram LM results in +75% improvement in decipherment accuracy. |
Introduction | (2006) use the Expectation Maximization (EM) algorithm (Dempster et al., 1977) to search for the best probabilistic key using letter n-gram models. |
Introduction | Ravi and Knight (2008) formulate decipherment as an integer programming problem and provide an exact method to solve simple substitution ciphers by using letter n-gram models along with deterministic key constraints. |
Introduction | 0 Our new method combines information from word dictionaries along with letter n-gram models, providing a robust decipherment model which offsets the disadvantages faced by previous approaches. |
Analysis | The mid-words m which rank highly are those where the occurrence of hma as an n-gram is a good indicator that a attaches to h (m of course does not have to actually occur in the sentence). |
Analysis | However, we additionally find other cues, most notably that if the N IN sequence occurs following a capitalized determiner, it tends to indicate a nominal attachment (in the n-gram , the preposition cannot attach leftward to anything else because of the beginning of the sentence). |
Analysis | These features essentially say that if two heads ml and 2122 occur in the direct coordination n-gram ml and wg, then they are good heads to coordinate (coordination unfortunately looks the same as complementation or modification to a basic dependency model). |
Introduction | (2010), which use Web-scale n-gram counts for multi-way noun bracketing decisions, though that work considers only sequences of nouns and uses only affinity-based web features. |
Working with Web n-Grams | Lapata and Keller (2004) uses the number of page hits as the web-count of the queried n-gram (which is problematic according to Kilgarriff (2007)). |
Working with Web n-Grams | Rather than working through a search API (or scraper), we use an offline web corpus — the Google n-gram corpus (Brants and Franz, 2006) — which contains English n-grams (n = l to 5) and their observed frequency counts, generated from nearly 1 trillion word tokens and 95 billion sentences. |
Working with Web n-Grams | Our system requires the counts from a large collection of these n-gram queries (around 4.5 million). |
Background | To learn the meaning of an n-gram 21), Chen and Mooney first collect all navigation plans 9 that co-occur with w. This forms the initial candidate meaning set for 21). |
Conclusion | In contrast to the previous approach that computed common subgraphs between different contexts in which an n-gram appeared, we instead focus on small, connected subgraphs and introduce an algorithm, SGOLL, that is an order of magnitude faster. |
Online Lexicon Learning Algorithm | Even though they use beam-search to limit the size of the candidate set, if the initial candidate meaning set for a n-gram is large, it can take a long time to take just one pass through the list of all candidates. |
Online Lexicon Learning Algorithm | : function Update(training example (67;, 1%)) for n-gram w that appears in 67; do |
Online Lexicon Learning Algorithm | 14: Increase the count of examples, each n-gram w and each subgraph g 15: end function |
Discussion and Future Work | The TESLA-M metric allows each n-gram to have a weight, which is primarily used to discount function words. |
Experiments | Based on these synonyms, TESLA-CELAB is able to award less trivial n-gram matches, such as T fiflifififl. |
Experiments | The covered n-gram matching rule is then able to award tricky n-grams such as TE, Ti, /1\ [E], 1/13 [IE5 and i9}. |
Motivation | We formulate the n-gram matching process as a real-valued linear programming problem, which can be solved efficiently. |
The Algorithm | The basic n-gram matching problem is shown in Figure 2. |
The Algorithm | We observe that since has been matched, all its sub-n-grams should be considered matched as well, including and We call this the covered n-gram matching rule. |
The Algorithm | However, we cannot simply perform covered n-gram matching as a post processing step. |
Abstract | We aim to improve spoken term detection performance by incorporating contextual information beyond traditional N-gram language models. |
Conclusions | Using word repetitions, we effectively use a broad document context outside of the typical 2-5 N-gram window. |
Introduction | ASR systems traditionally use N-gram language models to incorporate prior knowledge of word occurrence patterns into prediction of the next word in the token stream. |
Introduction | N-gram models cannot, however, capture complex linguistic or topical phenomena that occur outside the typical 3-5 word scope of the model. |
Introduction | Confidence scores from an ASR system (which incorporate N-gram probabilities) are optimized in order to produce the most likely sequence of words rather than the accuracy of individual word detections. |
Motivation | We seek a workable definition of broad document context beyond N-gram models that will improve term detection performance on an arbitrary set of queries. |
Motivation | A number of efforts have been made to augment traditional N-gram models with latent topic information (Khudanpur and Wu, 1999; Florian and Yarowsky, 1999; Liu and Liu, 2008; Hsu and Glass, 2006; Naptali et al., 2012) including some of the early work on Probabilistic Latent Semantic Analysis by Hofmann (2001). |
Term and Document Frequency Statistics | In applying the burstiness quantity to term detection, we recall that the task requires us to locate a particular instance of a term, not estimate a count, hence the utility of N-gram language models predicting words in sequence. |
Evaluation | We smooth by adding 40 to all counts, equal to the minimum count in the n-gram data. |
Introduction | For example, in our n-gram collection (Section 3.4), “make it in advance” and “make them in advance” occur roughly the same number of times (442 vs. 449), indicating a referential pattern. |
Methodology | We gather pattern fillers from a large collection of n-gram frequencies. |
Methodology | We do the same processing to our n-gram corpus. |
Methodology | Also, other pronouns in the pattern are allowed to match a corresponding pronoun in an n-gram , regardless of differences in inflection and class. |
Experiments | For each data set, we investigate an extensive set of combinations of hyper-parameters: the n-gram window (l,r) in {(1, 1), (2,1), (1,2), (2,2)}; the hidden layer size in {200, 300, 400}; the learning rate in {0.1, 0.01, 0.001}. |
Experiments | 5.3.2 Word and N-gram Representation |
Experiments | By contrast, using n-gram representations improves the performance on both oov and non-oov. |
Learning from Web Text | The basic idea is to share word representations across different positions in the input n-gram while using position-dependent weights to distinguish between different word orders. |
Learning from Web Text | Let V0) represents the j-th visible variable of the WRRBM, which is a vector of length Then V0) 2 wk; means that the j-th word in the n-gram is wk. |
Neural Network for POS Disambiguation | The input for the this module is the word n-gram (7,0,4, . |
Neural Network for POS Disambiguation | representations of the input n-gram . |
Related Work | While those approaches mainly explore token-level representations (word or character embeddings), using WRRBM is able to utilize both word and n-gram representations. |
Feature functions | We use 1‘91.” to denote the n-gram substring p1 . |
Feature functions | The two substrings E and b are said to be equal if they have the same length and a7; 2 ()7; for 1 S i S n. For a given sub-word unit n-gram U E 73’", we use the shorthand U 6 T9 to mean that we can find U in 1‘9; i.e., there eXists an indeX i such that my”, 2 H. We use |f9| to denote the length of the sequence 1‘9. |
Feature functions | Similarly to (Zweig et al., 2010), we adapt TF and IDF by treating a sequence of sub-word units as a “document” and n-gram sub-sequences as “words.” In this analogy, we use sub-sequences in surface pronunciations to “search” for baseforms in the dictionary. |
Introduction | Table 1 shows the n-gram overlap proportions in a sentence aligned data set of 137K sentence pairs from aligning Simple English Wikipedia and English Wikipedia articles (Coster and Kauchak, 2011a).1 The data highlights two conflicting views: does the benefit of additional data outweigh the problem of the source of the data? |
Introduction | n-gram size: 1 2 3 4 5 simple in normal 0.96 0.80 0.68 0.61 0.55 normal in simple 0.87 0.68 0.58 0.51 0.46 |
Why Does Unsimplified Data Help? | trained using a smoothed version of the maximum likelihood estimate for an n-gram . |
Why Does Unsimplified Data Help? | count(bc) where count(-) is the number of times the n-gram occurs in the training corpus. |
Why Does Unsimplified Data Help? | For interpolated and backoff n-gram models, these counts are smoothed based on the probabilities of lower order n-gram models, which are inturn calculated based on counts from the corpus. |
Experiments | The NBoW performs similarly to the non-neural n-gram based classifiers. |
Experiments | Besides the RecNN that uses an external parser to produce structural features for the model, the other models use n-gram based or neural features that do not require external resources or additional annotations. |
Experiments | We see a significant increase in the performance of the DCNN with respect to the non-neural n-gram based classifiers; in the presence of large amounts of training data these classifiers constitute particularly strong baselines. |
Introduction | Convolving the same filter with the n-gram at every position in the sentence allows the features to be extracted independently of their position in the sentence. |
Properties of the Sentence Model | 4.1 Word and n-Gram Order |
Properties of the Sentence Model | For most applications and in order to learn fine-grained feature detectors, it is beneficial for a model to be able to discriminate whether a specific n-gram occurs in the input. |
Properties of the Sentence Model | 2.3, the Max-TDNN is sensitive to word order, but max pooling only picks out a single n-gram feature in each row of the sentence matrix. |
Dependency language model | The standard N-gram based language model predicts the next word based on the N — 1 immediate previous words. |
Dependency language model | However, the traditional N-gram language model can not capture long-distance word relations. |
Dependency language model | The N-gram DLM predicts the next child of a head based on the N — 1 immediate previous children and the head itself. |
Experiments | Then, we studied the effect of adding different N-gram DLMs to MSTl. |
Introduction | The N-gram DLM has the ability to predict the next child based on the N-l immediate previous children and their head (Shen et al., 2008). |
Introduction | The DLM-based features can capture the N-gram information of the parent-children structures for the parsing model. |
Machine Translation as a Decipherment Task | For P (e), we use a word n-gram LM trained on monolingual English data. |
Machine Translation as a Decipherment Task | Whole-segment Language Models: When using word n-gram models of English for decipherment, we find that some of the foreign sentences are decoded into sequences (such as “THANK YOU TALKING ABOUT ‘2”) that are not good English. |
Machine Translation as a Decipherment Task | This stems from the fact that n-gram LMs have no global information about what constitutes a valid English segment. |
Word Substitution Decipherment | We model P(e) using a statistical word n-gram English language model (LM). |
Word Substitution Decipherment | Our method holds several other advantages over the EM approach—(l) inference using smart sampling strategies permits efficient training, allowing us to scale to large data/vocabulary sizes, (2) incremental scoring of derivations during sampling allows efficient inference even when we use higher-order n-gram LMs, (3) there are no memory bottlenecks since the full channel model and derivation lattice are never instantiated during training, and (4) prior specification allows us to learn skewed distributions that are useful here—word substitution ciphers exhibit l-to-l correspondence between plaintext and cipher types. |
Word Substitution Decipherment | build an English word n-gram LM, which is used in the decipherment process. |
Experimental Evaluation | N-gram suggests no n-referential instances |
Experimental Evaluation | Effectiveness To understand the contribution of the n-gram (NG), ontology (ON), and clustering (CL) based modules, we ran each separately, as well as every possible combination. |
Experimental Evaluation | Of the three individual modules, the n-gram and clustering methods achieve F-measure of around 0.9, while the ontology-based module performs only modestly above baseline. |
Term Ambiguity Detection (TAD) | This module examines n-gram data from a large text collection. |
Term Ambiguity Detection (TAD) | The rationale behind the n-gram module is based on the understanding that terms appearing in non-named entity contexts are likely to be non-referential, and terms that can be non-referential are ambiguous. |
Term Ambiguity Detection (TAD) | Since we wish for the ambiguity detection determination to be fast, we develop our method to make this judgment solely on the n-gram probability, without the need to examine each individual usage context. |
Smoothing on count distributions | On integral counts, this is simple: we generate, for each n-gram type vu’w, an (n — l)—gram token u’w, for a total of n1+(ou’w) tokens. |
Smoothing on count distributions | Analogously, on count distributions, for each n-gram type vu’w, we generate an (n — l)-gram token u’w with probability p(c(vu’w) > 0). |
Smoothing on count distributions | Using the dynamic program in Section 3.2, computing the distributions for each r is linear in the number of n-gram types, and we only need to compute the distributions up to r = 2 (or r = 4 for modified KN), and store them for r = 0 (or up to r = 2 for modified KN). |
Smoothing on integral counts | Let uw stand for an n-gram , where u stands for the (n — l) context words and w, the predicted word. |
Smoothing on integral counts | where n1+(uo) = |{w | C(uw) > OH is the number of word types observed after context u, and qu(w) specifies how to distribute the subtracted discounts among unseen n-gram types. |
Smoothing on integral counts | The probability of an n-gram token uw using the other tokens as training data is |
Abstract | Sentence compression has been shown to benefit from joint inference involving both n-gram and dependency-factored objectives but this typically requires expensive integer programming. |
Experiments | 0 ILP-Dep: A version of the joint ILP of Thadani and McKeown (2013) without n-gram variables and corresponding features. |
Experiments | Starting with the n-gram approaches, the performance of 3-LM leads us to observe that the gains of supervised learning far outweigh the utility of higher-order n- gram factorization, which is also responsible for a significant increase in wall-clock time. |
Experiments | We were surprised by the strong performance of the dependency-based inference techniques, which yielded results that approached the joint model in both n-gram and parse-based measures. |
Introduction | Following an assumption often used in compression systems, the compressed output in this corpus is constructed by dropping tokens from the input sentence without any paraphrasing or reordering.1 A number of diverse approaches have been proposed for deletion-based sentence compression, including techniques that assemble the output text under an n-gram factorization over the input text (McDonald, 2006; Clarke and Lapata, 2008) or an arc factorization over input dependency parses (Filippova and Strube, 2008; Galanis and Androutsopoulos, 2010; Filippova and Altun, 2013). |
Introduction | In this work, we develop approximate inference strategies to the joint approach of Thadani and McKeown (2013) which trade the optimality guarantees of exact ILP for faster inference by separately solving the n-gram and dependency subproblems and using Lagrange multipliers to enforce consistency between their solutions. |
Multi-Structure Sentence Compression | Let a(y) E {0, 1}” denote the incidence vector of tokens contained in the n-gram sequence y and ,6(z) E {0, 1}” denote the incidence vector of words contained in the dependency tree 2. |
Class-Based Language Modeling | Generalizing this leads to arbitrary order class-based n-gram models of the form: |
Introduction | In the case of n-gram language models this is done by factoring the probability: |
Introduction | do not differ in the last n — 1 words, one problem n-gram language models suffer from is that the training data is too sparse to reliably estimate all conditional probabilities P(w,~ lwzf 1). |
Introduction | However, in the area of statistical machine translation, especially in the context of large training corpora, fewer experiments with class-based n-gram models have been performed with mixed success (Raab, 2006). |
Abstract | To improve recall, irregularities in parse trees caused by verb form errors are taken into account; to improve precision, n-gram counts are utilized to filter proposed corrections. |
Experiments | For those categories with a high rate of false positives (all except BASEmd, BASEdO and FINITE), we utilized n-grams as filters, allowing a correction only when its n-gram count in the WEB 1T 5-GRAM |
Experiments | Some kind of confidence measure on the n-gram counts might be appropriate for reducing such false alarms. |
Introduction | To improve recall, irregularities in parse trees caused by verb form errors are considered; to improve precision, n-gram counts are utilized to filter proposed corrections. |
Research Issues | We propose using n-gram counts as a filter to counter this kind of overgeneralization. |
Research Issues | A second goal is to show that n-gram counts can eflectively serve as a filter; in order to increase precision. |
Baseline MT | The LM used for decoding is a log-linear combination of four word n-gram LMs which are built on different English |
Name-aware MT Evaluation | where wn is a set of positive weights summing to one and usually uniformly set as 712,, = l/N, c is the length of the system translation and 7“ is the length of reference translation, and pn is modified n-gram precision defined as: Z Z Countelipm-gram) |
Name-aware MT Evaluation | As in BLEU metric, we first count the maximum number of times an n-gram occurs in any single reference translation. |
Name-aware MT Evaluation | The weight of an n-gram in reference translation is the sum of weights of all tokens it contains. |
How Frequent is Unseen Data? | The third line across the bottom of the figure is the number of unseen pairs using Google n-gram data as proxy argument counts. |
How Frequent is Unseen Data? | Creating argument counts from n-gram counts is described in detail below in section 5.2. |
Models | Using the Google n-gram corpus, we recorded all verb-noun co-occurrences, defined by appearing in any order in the same n-gram , up to and including 5-grams. |
Models | For instance, the test pair (throwsubject, ball) is considered seen if there exists an n-gram such that throw and ball are both included. |
Models | C (vd, n) as the number of times 2) and n (ignoring d) appear in the same n-gram . |
Results | The Google n-gram backoff model is almost as good as backing off to the Erk smoothing model. |
Background | makes use of n-gram language models over words represented as vectors of factors, including surface form, part of speech, supertag and semantic class. |
Background | In the anytime mode, a best-first search is performed with a con-figurable time limit: the scores assigned by the n-gram model determine the order of the edges on the agenda, and thus have an impact on realization speed. |
Background | the one that covers the most elementary predications in the input logical form, with ties broken according to the n-gram score. |
The Approach | Table 1: Percentage of complete realizations using an oracle n-gram model versus the best performing factored language model. |
Feature Construction | Despite the Omni-word can be seen as a subset of n-Gram feature. |
Feature Construction | It is not the same as the n-Gram feature. |
Feature Construction | N-Gram features are more fragmented. |
Experiments and Results | For our character n-gram experiments, we obtained LOWBOW representations for character 3-grams (only n-grams of size n = 3 were used) considering the 2, 500 most common n-grams. |
Experiments and Results | Also, n-gram information is more dense in documents than word-level information. |
Related Work | (2003) propose the use of language models at the n-gram character-level for AA, whereas Keselj et al. |
Related Work | on characters at the n-gram level (Plakias and Stamatatos, 2008a). |
Related Work | Acceptable performance in AA has been reported with character n-gram representations. |
Data and Approach Overview | (Ilia) Entity List Prior (Illa) Web N-Gram Context Prior |
Data and Approach Overview | n-gram ——> web quer logs domain specific entity act parameters prior \ |
Experiments | * Base—MCM: Our first version injects an informative prior for domain, dialog act and slot topic distributions using information extracted from only labeled training utterances and inject as prior constraints (corpus n-gram base measure during topic assignments. |
MultiLayer Context Model - MCM | * Web n-Gram Context Base Measure (2%): As explained in §3, we use the web n-grams as additional information for calculating the base measures of the Dirichlet topic distributions. |
MultiLayer Context Model - MCM | * Corpus n-Gram Base Measure (2%): Similar to other measures, MCM also encodes n-gram constraints as word-frequency features extracted from labeled utterances. |
New Sense Indicators | N-gram Probability Features The goal of the Type:NgramProb feature is to capture the fact that “unusual contexts” might imply new senses. |
New Sense Indicators | To capture this, we can look at the log probability of the word under consideration given its N-gram context, both according to an old-domain language model (call this 6°“) and a new-domain language |
New Sense Indicators | From these four values, we compute corpus-level (and therefore type-based) statistics of the new domain n-gram log probability (Eflgw, the difference between the n-gram probabilities in each domain (623” — 6:51), the difference between the n-gram and unigram probabilities in the new domain (EQSW — 633‘”), and finally the combined difference: 623"” — [SSW + 63:: — 635’). |
Future Work | This powerful method, which was proposed in (Kam and Kopec, 1996; Popat et al., 2001) in the context of a finite-state model (but not of TSP), can be easily extended to N-gram situations, and typically converges in a small number of iterations. |
Introduction | Typical nonlocal features include one or more n-gram language models as well as a distortion feature, measuring by how much the order of biphrases in the candidate translation deviates from their order in the source sentence. |
Phrase-based Decoding as TSP | 4.1 From Bigram to N-gram LM |
Phrase-based Decoding as TSP | If we want to extend the power of the model to general n-gram language models, and in particular to the 3-gram |
Phrase-based Decoding as TSP | The problem becomes even worse if we extend the compiling-out method to n-gram language models with n > 3. |
Distributed representations | For each training update, we read an n-gram x = (W1, . |
Distributed representations | We also create a corrupted or noise n-gram J? |
Distributed representations | 0 We corrupt the last word of each n-gram . |
Abstract | As the algorithm generates dependency trees for partial translations left-to-right in decoding, it allows for efficient integration of both n-gram and dependency language models. |
Introduction | In addition, it is straightforward to integrate n-gram language models into phrase-based decoders in which translation always grows left-to-right. |
Introduction | Unfortunately, as syntax-based decoders often generate target-language words in a bottom-up way using the CKY algorithm, integrating n-gram language models becomes more expensive because they have to maintain target boundary words at both ends of a partial translation (Chiang, 2007; Huang and Chiang, 2007). |
Introduction | 3. efficient integration of n-gram language model: as translation grows left-to-right in our algorithm, integrating n-gram language models is straightforward. |
Automated Approaches to Deceptive Opinion Spam Detection | In contrast to the other strategies just discussed, our text categorization approach to deception detection allows us to model both content and context with n-gram features. |
Automated Approaches to Deceptive Opinion Spam Detection | Specifically, we consider the following three n-gram feature sets, with the corresponding features lowercased and unstemmed: UNIGRAMS, BIGRAMS+, TRIGRAMS+, where the superscript + indicates that the feature set subsumes the preceding feature set. |
Automated Approaches to Deceptive Opinion Spam Detection | We consider all three n-gram feature sets, namely UNIGRAMS, BIGRAMS+, and TRIGRAMS+, with corresponding language models smoothed using the interpolated Kneser-Ney method (Chen and Goodman, 1996). |
Introduction | Notably, a combined classifier with both n-gram and psychological deception features achieves nearly 90% cross-validated accuracy on this task. |
Results and Discussion | Surprisingly, models trained only on UNIGRAMS—the simplest n-gram feature set—outperform all non—text-categorization approaches, and models trained on BIGRAMSJr perform even better (one-tailed sign test p = 0.07). |
Conclusion | A novel technique was also proposed to rank n-gram phrases where relevance based ranking was used in conjunction with a semi-supervised generative model. |
Empirical Evaluation | The reduced dataset consists of 1095586 tokens (after n-gram preprocessing in §4), 40102 posts with an average of 27 posts or interactions per pair. |
Model | Like most generative models for text, a post (document) is viewed as a bag of n-grams and each n-gram (word/phrase) takes one value from a predefined vocabulary. |
Phrase Ranking based on Relevance | While this is reasonable, a significant n-gram with high likelihood score may not necessarily be relevant to the problem domain. |
Phrase Ranking based on Relevance | This is nothing wrong per se because the statistical tests only judge significance of an n-gram, but a significant n-gram may not necessarily be relevant in a given problem domain. |
Abstract | The highly parallel nature of this data allows us to use simple n-gram comparisons to measure both the semantic adequacy and lexical dissimilarity of paraphrase candidates. |
Paraphrase Evaluation Metrics | Thus, no n-gram overlaps are required to determine the semantic adequacy of the paraphrase candidates. |
Paraphrase Evaluation Metrics | In essence, it is the inverse of BLEU since we want to minimize the number of n-gram overlaps between the two sentences. |
Paraphrase Evaluation Metrics | where N is the maximum n-gram considered and n- |
A Syntax Free Sequence-oriented Sentence Compression Method | Many studies on sentence compression employ the n-gram language model to evaluate the linguistic likelihood of a compressed sentence. |
A Syntax Free Sequence-oriented Sentence Compression Method | N-gram distribution of short sentences may different from that of long sentences. |
A Syntax Free Sequence-oriented Sentence Compression Method | Therefore, the n-gram probability sometimes disagrees with our intuition in terms of sentence compression. |
Experimental Evaluation | We developed the n-gram language model from a 9 year set of Mainichi Newspaper articles. |
Results and Discussion | This result shows that the n-gram language model is improper for sentence compression because the n-gram probability is computed by using a corpus that includes both short and long sentences. |
Experiments | In the tables, Lm denotes the n-gram language model feature, T mh denotes the feature of collocation between target head words and the candidate measure word, Smh denotes the feature of collocation between source head words and the candidate measure word, HS denotes the feature of source head word selection, Punc denotes the feature of target punctuation position, T [ex denotes surrounding word features in translation, Slex denotes surrounding word features in source sentence, and Pas denotes Part-Of-Speech feature. |
Introduction | In this case, an n-gram language model with n<15 cannot capture the MW-HW collocation. |
Our Method | For target features, n-gram language model score is defined as the sum of log n-gram probabilities within the target window after the measure |
Our Method | Target features Source features n-gram language model MW-HW collocation score MW-HW collocation surrounding words surrounding words source head word punctuation position POS tags |
Experiments | RESPONSE The n-gram and emotion features induced from the response. |
Experiments | The n-gram and emotion features induced from the response and the addressee’s utterance. |
Predicting Addressee’s Emotion | We extract all the n-grams (n g 3) in the response to induce (binary) n-gram features. |
Predicting Addressee’s Emotion | The extracted n-grams activate another set of binary n-gram features. |
Decipherment Model for Machine Translation | For P(e), we use a word n-gram language model (LM) trained on monolingual target text. |
Decipherment Model for Machine Translation | Generate a target (e.g., English) string 6 = 61.43;, with probability P (6) according to an n-gram language model. |
Feature-based representation for Source and Target | For instance, context features for word w may include other words (or phrases) that appear in the immediate context ( n-gram window) surrounding w in the monolingual corpus. |
Feature-based representation for Source and Target | The feature construction process is described in more detail below: Target Language: We represent each word (or phrase) ei with the following contextual features along with their counts: (a) ficontem: every (word n-gram , position) pair immediately preceding e,-in the monolingual corpus (n=l , position=— l), (b) similar features f+conte$t to model the context following ei, and (c) we also throw in generic context features fscontewt without position information—every word that co-occurs with e, in the same sen- |
Experiments and Results | The performance of WTMF on CDR is compared with (a) an Information Retrieval model (IR) that is based on surface word matching, (b) an n-gram model ( N-gram ) that captures phrase overlaps by returning the number of overlapping ngrams as the similarity score of two sentences, (c) LSA that uses svds() function in Matlab, and (d) LDA that uses Gibbs Sampling for inference (Griffiths and Steyvers, 2004). |
Experiments and Results | The similarity of two sentences is computed by cosine similarity (except N-gram ). |
Experiments and Results | We mainly compare the performance of IR, N-gram , LSA, LDA, and WTMF models. |
Architecture of BRAINSUP | N-gram likelihood. |
Architecture of BRAINSUP | This is simply the likelihood of a sentence estimated by an n-gram language model, to enforce the generation of well-formed word sequences. |
Architecture of BRAINSUP | When a solution is not complete, in the computation we include only the sequences of contiguous words (i.e., not interrupted by empty slots) having length greater than or equal to the order of the n-gram model. |
Evaluation | The four combinations of features are: base: Target-word scorer + N-gram likelihood + Dependency likelihood + Variety scorer + Unusual-words scorer + Semantic cohesion; base+D: all the scorers in base + Domain relatedness; base+D+C: all the scorers in base+D + Chromatic connotation; base+D+E: all the scorers in base+D + Emotional connotation; base+D+P: all the scorers in base+D + Phonetic features. |
Experimental Evaluation | BLEUR includes the following 18 sentence-level scores: BLEU-n and n-gram precision scores (1 g n g 4); BLEU brevity penalty (BP); BLEU score divided by BP. |
Experimental Evaluation | To counteract BLEU’s brittleness at the sentence level, we also smooth BLEU-n and n-gram precision as in Lin and Och (2004). |
Experimental Evaluation | NIST-n scores (1 g n g 10) and information-weighted n-gram precision scores (1 g n g 4); NIST brevity penalty (BP); and NIST score divided by BP. |
Introduction | BLEU and NIST measure MT quality by using the strong correlation between human judgments and the degree of n-gram overlap between a system hypothesis translation and one or more reference translations. |
Abstract | In this work, we construct a statistical model of grammaticality using various linguistic features (e.g., misspelling counts, parser outputs, n-gram language model scores). |
Discussion and Conclusions | While Post found that such a system can effectively distinguish grammatical news text sentences from sentences generated by a language model, measuring the grammaticality of real sentences from language leam-ers seems to require a wider variety of features, including n-gram counts, language model scores, etc. |
Experiments | n-gram frequencies from Gigaword and whether the link parser can fully parse the sentence. |
System Description | 3.2.2 n-gram Count and Language Model Features |
Abstract | As machine translation systems improve in lexical choice and fluency, the shortcomings of widespread n-gram based, fluency-oriented MT evaluation metrics such as BLEU, which fail to properly evaluate adequacy, become more apparent. |
Abstract | N-gram based metrics assume that “good” translations tend to share the same leXical choices as the reference translations. |
Abstract | As MT systems improve, the shortcomings of the n-gram based evaluation metrics are becoming more apparent. |
Experiments | SEG-I method requires an access to a large web n-gram corpus (Brants and Franz, 2006). |
Experiments | where SQ is the set of all possible query segmenta-tions, 8 is a possible segmentation, s is a segment in S, and count(s) is the frequency of s in the web n-gram corpus. |
Experiments | (2009), and include, among others, n-gram frequencies in a sample of a query log, web corpus and Wikipedia titles. |
Independent Query Annotations | (2010), an estimate of p(Cz-|7“) is a smoothed estimator that combines the information from the retrieved sentence 7“ with the information about unigrams (for capitalization and POS tagging) and bigrams (for segmentation) from a large n-gram corpus (Brants and Franz, 2006). |
Model | trailing word in the n-gram ). |
Model | Intuitively, an update to the parameter of from 2,24 occurs after the learner observes word w,- in a context (this may be an n-gram , an entire sentence or paragraph containing 10,-, but we will restrict our attention to fixed-length n-grams). |
Model | For this task, we focus only on the following context features for predicting the “predictability” of words: n-gram probability, vector-space similarity score, coreferring mentions. |
Abstract | On the task shown in (Ravi and Knight, 2011) we obtain better results with only 5% of the computational effort when running our method with an n-gram language model. |
Experimental Evaluation | being approximately 15 to 20 times faster than their n-gram based approach. |
Experimental Evaluation | To summarize: Our method is significantly faster than n-gram LM based approaches and obtains better results than any previously published method. |
Translation Model | Stochastically generate the target sentence according to an n-gram language model. |
Decoder Integration | One exception is the n-gram language model which requires the preceding n — 1 words as well. |
Decoder Integration | To solve this problem, we follow previous work on lattice rescoring with recurrent networks that maintained the usual n-gram context but kept a beam of hidden layer configurations at each state (Auli et al., 2013). |
Decoder Integration | This approximation has been effective for lattice rescoring, since the translations represented by each state are in fact very similar: They share both the same source words as well as the same n-gram context which is likely to result in similar recurrent histories that can be safely pruned. |
Introduction | Decoding with feed-forward architectures is straightforward, since predictions are based on a fixed size input, similar to n-gram language models. |
BLEU and PORT | First, define n-gram precision p(n) and recall r(n): |
BLEU and PORT | where Pg N) is the geometr1c average of n-gram prec1s1ons |
BLEU and PORT | The average precision and average recall used in PORT (unlike those used in BLEU) are the arithmetic average of n-gram precisions Pa(N) and recalls Ra(N): |
Experiments | As usual, French-English is the outlier: the two outputs here are typically so similar that BLEU and Qmean tuning yield very similar n-gram statistics. |
Related Work | Smoothing in NLP usually refers to the problem of smoothing n-gram models. |
Related Work | Sophisticated smoothing techniques like modified Kneser-Ney and Katz smoothing (Chen and Goodman, 1996) smooth together the predictions of unigram, bi-gram, trigram, and potentially higher n-gram sequences to obtain accurate probability estimates in the face of data sparsity. |
Related Work | While n-gram models have traditionally dominated in language modeling, two recent efforts de- |
Features and Training | Here simple n-gram similarity is used for the sake of efficiency. |
Features and Training | Like GC , there are four features with respect to the value of n in n-gram similarity measure. |
Graph Construction | Solid lines are edges connecting nodes with sufficient source side n-gram similarity, such as the one between "E A M N" and "E A B C". |
Introduction | Collaborative decoding (Li et al., 2009) scores the translation of a source span by its n-gram similarity to the translations by other systems. |
Methods | In order to annotate lattice edges with an n-gram LM, every path coming into a node must end with the same sequence of (n — l) tokens. |
Methods | Programmatic composition of a lattice with an n-gram LM acceptor is a well understood problem. |
Methods | With each node corresponding to a single LM context, annotation of outgoing edges with n-gram LM scores is straightforward. |
Experiments | Feature sources: The n-gram semantic features are extracted from the Google n-grams corpus (Brants and Franz, 2006), a large collection of English n-grams (for n = 1 to 5) and their frequencies computed from almost 1 trillion tokens (95 billion sentences) of Web text. |
Features | 3.2 Semantic Features 3.2.1 Web n-gram Features Patterns and counts: Hypernymy for a term pair |
Features | Hence, we fire Web n-gram pattern features and Wikipedia presence, distance, and pattern features, similar to those described above, on each potential sibling term pair.7 The main difference here from the edge factors is that the sibling factors are symmetric (in the sense that Sig-k, is redundant to Sim) and hence the patterns are undirected. |
Hard cases and annotation errors | the longest sequence of words (n-gram) in a corpus that has been observed with a token being tagged differently in another occurence of the same n-gram in the same corpus. |
Hard cases and annotation errors | For each variation n-gram that we found in WSJ-OO, i.e, a word in various contexts and the possible tags associated with it, we present annotators with the cross product of contexts and tags. |
Hard cases and annotation errors | The figure shows, for instance, that the variation n-gram regarding ADP-ADV is the second most frequent one (dark gray), and approximately 70% of ADP-ADV disagreements are linguistically hard cases (light gray). |
A Skeleton-based Approach to MT 2.1 Skeleton Identification | For language modeling, lm is the standard n-gram language model adopted in the baseline system. |
A Skeleton-based Approach to MT 2.1 Skeleton Identification | In such a way of string representation, the skeletal language model can be implemented as a standard n-gram language model, that is, a string probability is calculated by a product of a sequence of n-gram probabilities (involving normal words and X). |
A Skeleton-based Approach to MT 2.1 Skeleton Identification | The skeletal language model is then trained on these generalized strings in a standard way of n-gram language modeling. |
Models | The N-gram approximation of the joint probability can be defined in terms of multigrams qi as: |
Models | N-gram models of order > 1 did not work well because these models tended to learn noise (information from non-transliteration pairs) in the training data. |
Previous Research | (2010) submitted another system based on a standard n-gram kernel which ranked first for the English/Hindi and English/Tamil tasks.6 For the English/Arabic task, the transliteration mining system of Noeman and Madkour (2010) was best. |
Discriminative Reranking for OCR | Word LM features (“LM-word”) include the log probabilities of the hypothesis obtained using n-gram LMs with n E {1, . |
Discriminative Reranking for OCR | Semantic coherence feature (“SemCoh”) is motivated by the fact that semantic information can be very useful in modeling the fluency of phrases, and can augment the information provided by n-gram LMs. |
Introduction | The BBN Byblos OCR system (Natajan et al., 2002; Prasad et al., 2008; Saleem et al., 2009), which we use in this paper, relies on a hidden Markov model (HMM) to recover the sequence of characters from the image, and uses an n-gram language model (LM) to emphasize the fluency of the output. |
Evaluation | Further examination of the differences between the two systems yielded that most of the improvements are due to better bigrams and trigrams, as indicated by the breakdown of the BLEU score precision per n-gram , and primarily leverages higher quality generated candidates from the baseline system. |
Evaluation | Furthermore, despite completely unaligned, non-comparable monolingual text on the Urdu and English sides, and a very large language model, we can still achieve gains in excess of 1.2 BLEU points (“SLP”) in a difficult evaluation scenario, which shows that the technique adds a genuine translation improvement over and above na‘1've memorization of n-gram sequences. |
Generation & Propagation | A nai've way to achieve this goal would be to extract all n-grams, from n = l to a maximum n-gram order, from the monolingual data, but this strategy would lead to a combinatorial explosion in the number of target phrases. |
Submodularity in Summarization | By definition (Lin, 2004), ROUGE-N is the n-gram recall between a candidate summary and a set of reference summaries. |
Submodularity in Summarization | Precisely, let S be the candidate summary (a set of sentences extracted from the ground set V), Ce : 2V —> Z+ be the number of times n-gram 6 occurs in summary 8, and R,- be the set of n-grams contained in the reference summary i (suppose we have K reference summaries, i.e., i = 1, - - - ,K). |
Submodularity in Summarization | where T6,, is the number of times n-gram 6 occurs in reference summary 2'. |
Experiments | The NIST metric clearly shows a significant improvement, because it mostly measures difficult n-gram matches (e. g. due to the long-distance rules we have been dealing with). |
Experiments | In n-gram based metrics, the scores for all words are equally weighted, so mistakes on crucial sentence constituents may be penalized the same as errors on redundant or meaningless words (Callison-Burch et al., 2006). |
Introduction | Thus, with respect to these methods, there is a problem when agreement needs to be applied on part of a sentence whose length exceeds the order of the of the target n-gram language model and the size of the chunks that are translated (see Figure 1 for an exam- |
Related Work | Chen (2003) uses an n-gram model and Viterbi decoder as a syllabifier, and then applies it as a preprocessing step in his maximum-entropy-based English L2P system. |
Syllabification with Structured SVMs | In addition to these primary n-gram features, we experimented with linguistically-derived features. |
Syllabification with Structured SVMs | We believe that this is caused by the ability of the SVM to learn such generalizations from the n-gram features alone. |
Features | Trying to find phrase translations for any possible n-gram is not a good idea for two reasons. |
Features | We will define a confidence metric to estimate how reliably the model can align an n-gram in one side to a phrase on the other side given a parallel sentence. |
Features | Now we turn to monolingual resources to evaluate the quality of an n-gram being a good phrase. |
The Model | In this model the distribution of the overall sentiment rating you is based on all the n-gram features of a review text. |
The Model | Then the distribution of ya, for every rated aspect a, can be computed from the distribution of you and from any n-gram feature where at least one word in the n-gram is assigned to the associated aspect topic (7“ 2 Zoe, 2 = a). |
The Model | b; is the bias term which regulates the prior distribution P(ya = y), f iterates through all the n-grams, J1me and Jif are common weights and aspect-specific weights for n-gram feature f. pinz is equal to a fraction of words in n-gram feature f assigned to the aspect topic (7“ = [00, z = a). |
Introduction | Although paraphrase identification is defined in semantic terms, it is usually solved using statistical classifiers based on shallow lexical, n-gram , and syntactic “overlap” features. |
Introduction | We use a product of experts (Hinton, 2002) to bring together a logistic regression classifier built from n-gram overlap features and our syntactic model. |
Product of Experts | The features are of the form precisionn (number of n-gram matches divided by the number of n-grams in 51), recalln (number of n-gram matches divided by the number of n-grams in 52) and E, (harmonic mean of the previous two features), where l g n g 3. |
Overall architecture | The unseen rate of n-gram varies according to the simulated user. |
Overall architecture | Notice that simulated user C, E and H generates higher unseen n-gram patterns over all word error settings. |
Related work | N-gram based approaches (Eckert et al., 1997, Levin et al., 2000) and other approaches (Scheffler and Young, 2001, Pietquin and Dutoit, 2006, Schatzmann et al., 2007) are introduced. |
Translation Selection | We use smoothed sentence-level BLEU score to replace the human assessments, where we use additive smoothing to avoid zero BLEU scores when we calculate the n-gram precisions. |
Translation Selection | 1-4 n-gram precisions against pseudo references (l g n g 4) |
Translation Selection | 15-19 n-gram precision against a target corpus (l g n g 5) |
Introduction | Aside from including dependency and n-gram relations in the scoring, we also apply and evaluate SemPOS for English. |
Problems of BLEU | Table 1 estimates the overall magnitude of this issue: For 1-grams to 4-grams in 1640 instances (different MT outputs and different annotators) of 200 sentences with manually flagged errors4, we count how often the n-gram is confirmed by the reference and how often it contains an error flag. |
Problems of BLEU | Fortunately, there are relatively few false positives in n-gram based metrics: 6.3% of unigrams and far fewer higher n-grams. |
Automated Classification | Web 1T N-gram Features |
Automated Classification | Table 3 describes the extracted n-gram features. |
Automated Classification | The influence of the Web lT n-gram features was somewhat mixed. |
MBR-based Answering Re-ranking | 0 answer-level n-gram correlation feature: |
MBR-based Answering Re-ranking | where w denotes an n-gram in A, #w(“4k3) denotes the number of times that w occurs in |
MBR-based Answering Re-ranking | o passage-level n-gram correlation feature: |
Conclusion | In our approach, n-gram sub-sequences of transitions per term in the discourse role matrix then constitute the more fine-grained evidence used in our model to distinguish coherence from incoherence. |
Using Discourse Relations | tion of the n-gram discourse relation transition sequences in gold standard coherent text, and a similar one for incoherent text. |
Using Discourse Relations | In our pilot work where we implemented such a basic model with n-gram features for relation transitions, the performance was very poor. |
Background: Hypergraphs | The second step is to integrate an n-gram language model with this hypergraph. |
Introduction | The decoding problem for a broad range of these systems (e.g., (Chiang, 2005; Marcu et al., 2006; Shen et al., 2008)) corresponds to the intersection of a (weighted) hypergraph with an n-gram language model.1 The hypergraph represents a large set of possible translations, and is created by applying a synchronous grammar to the source language string. |
Introduction | The language model is then uwdwmmmemMMmhmmMnmmwmmmmr Decoding with these models is challenging, largely because of the cost of integrating an n-gram language model into the search process. |
Experiments | Due to lower frequency of higher-order n-grams (as opposed to unigrams), higher-order n-gram language models are more sparse, which increases the probability of missing a particular sentiment marker in a sentence (Table 33). |
Experiments | training sets are required to overcome this higher n-gram sparseness in sentence-level annotation. |
Experiments | results depends on the genre and size of the n-gram : on product reviews, all results are statistically significant at oz 2 0.025 level; on movie reviews, the difference between NaVe Bayes and SVM is statistically significant at oz 2 0.01 but the significance diminishes as the size of the n- gram increases; on news, only bigrams produce a statistically significant (a = 0.01) difference between the two machine learning methods, while on blogs the difference between SVMs and NaVe Bayes is most pronounced when unigrams are used (a = 0.025). |
Experimental Setup | Standard n-gram features are also used as features.4 The feature model is built as follows: for every lemma in the f-structure, we extract a set of morphological properties (definiteness, person, pronominal status etc. |
Experimental Setup | use several standard measures: a) exact match: how often does the model select the original corpus sentence, b) BLEU: n-gram overlap between top-ranked and original sentence, c) NIST: modification of BLEU giving more weight to less frequent n-grams. |
Related Work | The first widely known data-driven approach to surface realisation, or tactical generation, (Langk—ilde and Knight, 1998) used language-model n-gram statistics on a word lattice of candidate realisations to guide a ranker. |
Abstract | To address semantic ambiguities in coreference resolution, we use Web n-gram features that capture a range of world knowledge in a diffuse but robust way. |
Introduction | In order to harness the information on the Web without presupposing a deep understanding of all Web text, we instead turn to a diverse collection of Web n-gram counts (Brants and Franz, 2006) which, in aggregate, contain diffuse and indirect, but often robust, cues to reference. |
Semantics via Web Features | These clusters come from distributional K -Means clustering (with K = 1000) on phrases, using the n-gram context as features. |
Conclusion & Future Work | In addition, we can extend our approach by applying some of the techniques used in other system combination approaches such as consensus decoding, using n-gram features, tuning using forest-based MERT, among other possible extensions. |
Related Work 5.1 Domain Adaptation | In other words, it requires all component models to fully decode each sentence, compute n-gram expectations from each component model and calculate posterior probabilities over translation derivations. |
Related Work 5.1 Domain Adaptation | Finally, main techniques used in this work are orthogonal to our approach such as Minimum Bayes Risk decoding, using n-gram features and tuning using MERT. |
Sentence Completion via Language Modeling | 3.1 Backoff N-gram Language Model |
Sentence Completion via Language Modeling | 3.2 Maximum Entropy Class-Based N-gram Language Model |
Sentence Completion via Latent Semantic Analysis | 4.3 A LSA N-gram Language Model |
Learning Class Attributes | We extract prevalent common nouns for males and females by selecting only those nouns that (a) occur more than 200 times in the dataset, (b) mostly occur with male or female pronouns, and (c) occur as lowercase more often than uppercase in a web-scale N-gram corpus (Lin et al., 2010). |
Learning Class Attributes | We obtain the best of both worlds by matching our precise pattern against a version of the Google N-gram Corpus that includes the part-of-speech tag distributions for every N-gram (Lin et al., 2010). |
Twitter Gender Prediction | We include n-gram features with the original capitalization pattern and separate features with the n- grams lower-cased. |