Index of papers in Proc. ACL that mention
  • n-gram
Amigó, Enrique and Giménez, Jesús and Gonzalo, Julio and Verdejo, Felisa
Abstract
However, n-gram based metrics are still today the dominant approach.
Alternatives to Correlation-based Meta-evaluation
Therefore, we have focused on other aspects of metric reliability that have revealed differences between n-gram and linguistic based metrics:
Alternatives to Correlation-based Meta-evaluation
All n-gram based metrics achieve SIP and SIR values between 0.8 and 0.9.
Alternatives to Correlation-based Meta-evaluation
This result suggests that n-gram based metrics are reasonably reliable for this purpose.
Correlation with Human Judgements
Let us first analyze the correlation with human judgements for linguistic vs. n-gram based metrics.
Correlation with Human Judgements
Linguistic metrics are represented by grey plots, and black plots represent metrics based on n-gram overlap.
Correlation with Human Judgements
Therefore, we need additional meta-evaluation criteria in order to clarify the behavior of linguistic metrics as compared to n-gram based metrics.
Introduction
However, the most commonly used metrics are still based on n-gram matching.
Introduction
For that purpose, we compare — using four different test beds — the performance of 16 n-gram based metrics, 48 linguistic metrics and one combined metric from the state of the art.
Previous Work on Machine Translation Meta-Evaluation
All that metrics were based on n-gram overlap.
n-gram is mentioned in 23 sentences in this paper.
Topics mentioned in this paper:
Tan, Ming and Zhou, Wenli and Zheng, Lei and Wang, Shaojun
Composite language model
The n-gram language model is essentially a word predictor that given its entire document history it predicts next word wk+1 based on the last n-l words with probability p(wk+1|w,’:_n+2) where w’g_n+2 = wk—n+27'°' 710k:-
Composite language model
The SLM (Chelba and J elinek, 1998; Chelba and Jelinek, 2000) uses syntactic information beyond the regular n-gram models to capture sentence level long range dependencies.
Composite language model
When combining n-gram , m order SLM and
Experimental results
The m-SLM performs competitively with its counterpart n-gram (n=m+1) on large scale corpus.
Introduction
The Markov chain ( n-gram ) source models, which predict each word on the basis of previous n-l words, have been the workhorses of state-of—the-art speech recognizers and machine translators that help to resolve acoustic or foreign language ambiguities by placing higher probability on more likely original underlying word strings.
Introduction
are efficient at encoding local word interactions, the n-gram model clearly ignores the rich syntactic and semantic structures that constrain natural languages.
Introduction
(2006) integrated n-gram , structured language model (SLM) (Chelba and Jelinek, 2000) and probabilistic latent semantic analysis (PLSA) (Hofmann, 2001) under the directed MRF framework (Wang et al., 2005) and studied the stochastic properties for the composite language model.
Training algorithm
Similar to SLM (Chelba and Jelinek, 2000), we adopt an N -best list approximate EM re-estimation with modular modifications to seamlessly incorporate the effect of n-gram and PLSA components.
Training algorithm
This implies that when computing the language model probability of a sentence in a client, all servers need to be contacted for each n-gram request.
Training algorithm
(2007) follows a standard MapReduce paradigm (Dean and Ghemawat, 2004): the corpus is first divided and loaded into a number of clients, and n-gram counts are collected at each client, then the n-gram counts mapped and stored in a number of servers, resulting in exactly one server being contacted per n-gram when computing the language model probability of a sentence.
n-gram is mentioned in 14 sentences in this paper.
Topics mentioned in this paper:
Zhou, Guangyou and Zhao, Jun and Liu, Kang and Cai, Li
Introduction
Another is a web-scale N-gram corpus, which is a N-gram corpus with N-grams of length 1-5 (Brants and Franz, 2006), we call it Google V1 in this paper.
Web-Derived Selectional Preference Features
one is web, N-gram counts are approximated by Google hits.
Web-Derived Selectional Preference Features
This N-gram corpus records how often each unique sequence of words occurs.
Web-Derived Selectional Preference Features
times or more (1 in 25 billion) are kept, and appear in the n-gram tables.
n-gram is mentioned in 27 sentences in this paper.
Topics mentioned in this paper:
Mochihashi, Daichi and Yamada, Takeshi and Ueda, Naonori
Abstract
Our model is also considered as a way to construct an accurate word n-gram language model directly from characters of arbitrary language, without any “wor ” indications.
Experiments
NPY(n) means n-gram NPYLM.
Inference
character n-gram model explicit as 9 we can set
Inference
where p(cl---ck,k|@) is an n-gram probability given by (6), and p(l<:|@) is a probability that a word of length k will be generated from 9.
Introduction
In this paper, we extend this work to propose a more efficient and accurate unsupervised word segmentation that will optimize the performance of the word n-gram Pitman-Yor (i.e.
Introduction
Furthermore, it can be viewed as a method for building a high-performance n-gram language model directly from character strings of arbitrary language.
Introduction
By embedding a character n-gram in word n-gram from a Bayesian perspective, Section 3 introduces a novel language model for word segmentation, which we call the Nested Pitman-Yor language model.
Nested Pitman-Yor Language Model
In contrast, in this paper we use a simple but more elaborate model, that is, a character n-gram language model that also employs HPYLM.
Nested Pitman-Yor Language Model
To avoid dependency on n-gram order n, we actually used the oo-gram language model (Mochihashi and Sumita, 2007), a variable order HPYLM, for characters.
Pitman-Yor process and n-gram models
In this representation, each n-gram context h (including the null context 6 for unigrams) is a Chinese restaurant whose customers are the n-gram counts seated over the tables 1 - - -thw.
Pitman-Yor process and n-gram models
As a result, the n-gram probability of this hierarchical Pitman—Yor language model (HPYLM) is recursively computed as
n-gram is mentioned in 13 sentences in this paper.
Topics mentioned in this paper:
Li, Zhifei and Eisner, Jason and Khudanpur, Sanjeev
Abstract
Our particular variational distributions are parameterized as n-gram models.
Abstract
We also analytically show that interpolating these n-gram models for different n is similar to minimum-risk decoding for BLEU (Tromble et al., 2008).
Introduction
In practice, we approximate with several different variational families Q, corresponding to n-gram (Markov) models of different orders.
Variational Approximate Decoding
Since each q(y) is a distribution over output strings, a natural choice for Q is the family of n-gram models.
Variational Approximate Decoding
where W is a set of n-gram types.
Variational Approximate Decoding
Each 212 E W is an n-gram , which occurs cw(y) times in the string 3/, and 212 may be divided into an (n — l)-gram prefix (the history) and a l-gram suffix r(w) (the rightmost or current word).
n-gram is mentioned in 32 sentences in this paper.
Topics mentioned in this paper:
Li, Mu and Duan, Nan and Zhang, Dongdong and Li, Chi-Ho and Zhou, Ming
Abstract
Using an iterative decoding approach, n-gram agreement statistics between translations of multiple decoders are employed to re-rank both full and partial hypothesis explored in decoding.
Collaborative Decoding
To compute the consensus measures, we further decompose each 61 (e, 6') into n-gram matching statistics between 6 and 6'.
Collaborative Decoding
For each n-gram of order n, we introduce a pair of complementary consensus measure functions Gn+(e, e') and 671— (e, 6') described as follows:
Collaborative Decoding
Gn+(e,e') is the n-gram agreement measure function which counts the number of occurrences in e'of n-grams in 6.
Experiments
All baseline decoders are extended with n-gram consensus —based co-decoding features to construct member decoders.
Experiments
Table 4 shows the comparison results of a two-system co-decoding using different settings of n-gram agreement and disagreement features.
Experiments
It is clearly shown that both n-gram agreement and disagreement types of features are helpful, and using them together is the best choice.
n-gram is mentioned in 15 sentences in this paper.
Topics mentioned in this paper:
Kumar, Shankar and Macherey, Wolfgang and Dyer, Chris and Och, Franz
Introduction
Lattice MBR decoding uses a linear approximation to the BLEU score (Pap-ineni et al., 2001); the weights in this linear loss are set heuristically by assuming that n-gram pre-cisions decay exponentially with n. However, this may not be optimal in practice.
Minimum Bayes-Risk Decoding
They approximated log(BLEU) score by a linear function of n-gram matches and candidate length.
Minimum Bayes-Risk Decoding
where w is an n-gram present in either E or E’, and 60, 61, ..., 6 N are weights which are determined empirically, where N is the maximum n-gram order.
Minimum Bayes-Risk Decoding
Next, the posterior probability of each n-gram is computed.
n-gram is mentioned in 21 sentences in this paper.
Topics mentioned in this paper:
Schwartz, Lane and Callison-Burch, Chris and Schuler, William and Wu, Stephen
Introduction
Drawing on earlier successes in speech recognition, research in statistical machine translation has effectively used n-gram word sequence models as language models.
Introduction
Modern phrase-based translation using large scale n-gram language models generally performs well in terms of lexical choice, but still often produces ungrammatical output.
Related Work
(2007) use supertag n-gram LMs.
Related Work
An n-gram language model history is also maintained at each node in the translation lattice.
Related Work
The search space is further trimmed with hypothesis recombination, which collapses lattice nodes that share a common coverage vector and n-gram state.
n-gram is mentioned in 16 sentences in this paper.
Topics mentioned in this paper:
Aker, Ahmet and Gaizauskas, Robert
Abstract
Our results show that summaries biased by dependency pattern models lead to significantly higher ROUGE scores than both n-gram language models reported in previous work and also Wikipedia baseline summaries.
Discussion and Conclusion
Our evaluations show that such an approach yields summaries which score more highly than an approach which uses a simpler representation of an object type model in the form of a n-gram language model.
Introduction
a corpus of descriptions of churches, a corpus of bridge descriptions, and so on) and reported results showing that incorporating such n-gram language models as a feature in a feature-based extractive summarizer improves the quality of automatically generated summaries.
Introduction
The main weakness of n-gram language models is that they only capture very local information aboutshofitennsequencesandcannotnnxkfllong distance dependencies between terms.
Introduction
If this information is expressed as in the first line of Table l, n-gram language models are likely to
Representing conceptual models 2.1 Object type corpora
We derive n-gram language and dependency pattern models using object type corpora made available to us by Aker and Gaizauskas.
Representing conceptual models 2.1 Object type corpora
2.2 N-gram language models
Representing conceptual models 2.1 Object type corpora
they calculate the probability that a sentence is generated based on a n-gram language model.
Summarizer
o LMSim3 : The similarity of a sentence S to an n-gram language model LM (the probability that the sentence S is generated by LM).
n-gram is mentioned in 11 sentences in this paper.
Topics mentioned in this paper:
Devlin, Jacob and Zbib, Rabih and Huang, Zhongqiang and Lamar, Thomas and Schwartz, Richard and Makhoul, John
Decoding with the NNJ M
Because our NNJM is fundamentally an n-gram NNLM with additional source context, it can easily be integrated into any SMT decoder.
Decoding with the NNJ M
When performing hierarchical decoding with an n-gram LM, the leftmost and rightmost n — 1 words from each constituent must be stored in the state space.
Decoding with the NNJ M
We also train a separate lower-order n-gram model, which is necessary to compute estimate scores during hierarchical decoding.
Introduction
Initially, these models were primarily used to create n-gram neural network language models (NNLMs) for speech recognition and machine translation (Bengio et al., 2003; Schwenk, 2010).
Introduction
Specifically, we introduce a novel formulation for a neural network joint model (NNJ M), which augments an n-gram target language model with an m-word source window.
Model Variations
Specifically, this means that we don’t use dependency-based rule extraction, and our decoder only contains the following MT features: (1) rule probabilities, (2) n-gram Kneser-Ney LM, (3) lexical smoothing, (4) target word count, (5) con-cat rule penalty.
Model Variations
This does not include the cost of n-gram creation or cached lookups, which amount to ~0.03 seconds per source word in our current implementation.14 However, the n-grams created for the NNJ M can be shared with the Kneser-Ney LM, which reduces the cost of that feature.
Model Variations
14In our decoder, roughly 95% of NNJM n-gram lookups within the same sentence are duplicates.
Neural Network Joint Model (NNJ M)
Formally, our model approximates the probability of target hypothesis T conditioned on source sentence S. We follow the standard n-gram LM decomposition of the target, where each target word ti is conditioned on the previous n — 1 target words.
Neural Network Joint Model (NNJ M)
If our neural network has only one hidden layer and is self-normalized, the only remaining computation is 512 calls to tanho and a single 513-dimensional dot product for the final output score.6 Thus, only ~3500 arithmetic operations are required per n-gram lookup, compared to ~2.8M for self-normalized NNJ M without pre-computation, and ~35M for the standard NNJM.7
Neural Network Joint Model (NNJ M)
“lookups/sec” is the number of unique n-gram probabilities that can be computed per second.
n-gram is mentioned in 11 sentences in this paper.
Topics mentioned in this paper:
He, Wei and Wang, Haifeng and Guo, Yuqing and Liu, Ting
Introduction
One is n-gram model over different units, such as word-level bigram/trigram models (Bangalore and Rambow, 2000; Langkilde, 2000), or factored language models integrated with syntactic tags (White et al.
Introduction
(2008) develop a general-purpose realizer couched in the framework of Lexical Functional Grammar based on simple n-gram models.
Introduction
(2009) present a dependency-spanning tree algorithm for word ordering, which first builds dependency trees to decide linear precedence between heads and modifiers then uses an n-gram language model to order siblings.
Log-linear Models
We linearize the dependency relations by computing n-gram models, similar to traditional word-based language models, except using the names of dependency relations instead of words.
Log-linear Models
The dependency relation model calculates the probability of dependency relation n-gram P(DR) according to Eq.(3).
Log-linear Models
= HP(DRk I Dle—jn k=1 Word Model: We integrate an n-gram word model into the log-linear model for capturing the relation between adjacent words.
n-gram is mentioned in 11 sentences in this paper.
Topics mentioned in this paper:
DeNero, John and Chiang, David and Knight, Kevin
Computing Feature Expectations
In this section, we consider BLEU in particular, for which the relevant features gb(e) are n-gram counts up to length n = 4.
Computing Feature Expectations
The nodes are states in the decoding process that include the span (2', j) of the sentence to be translated, the grammar symbol 3 over that span, and the left and right context words of the translation relevant for computing n-gram language model scores.3 Each hyper-edge h represents the application of a synchronous rule 7" that combines nodes corresponding to non-terminals in
Computing Feature Expectations
Each 71- gram that appears in a translation 6 is associated with some h in its derivation: the h corresponding to the rule that produces the n-gram .
Consensus Decoding Algorithms
In this expression, BLEU(e; 6’) references 6’ only via its n-gram count features C(e’ , t).2 2The length penalty (1 — is also a function of n-
Introduction
The contributions of this paper include a linear-time algorithm for MBR using linear similarities, a linear-time alternative to MBR using nonlinear similarity measures, and a forest-based extension to this procedure for similarities based on n-gram counts.
n-gram is mentioned in 26 sentences in this paper.
Topics mentioned in this paper:
Pauls, Adam and Klein, Dan
Abstract
We propose a simple generative, syntactic language model that conditions on overlapping windows of tree context (or treelets) in the same way that n-gram language models condition on overlapping windows of linear context.
Abstract
We estimate the parameters of our model by collecting counts from automatically parsed text using standard n-gram language model estimation techniques, allowing us to train a model on over one billion tokens of data using a single machine in a matter of hours.
Introduction
At the same time, because n-gram language models only condition on a local window of linear word-level context, they are poor models of long-range syntactic dependencies.
Introduction
Although several lines of work have proposed generative syntactic language models that improve on n-gram models for moderate amounts of data (Chelba, 1997; Xu et al., 2002; Charniak, 2001; Hall, 2004; Roark,
Introduction
Our model can be trained simply by collecting counts and using the same smoothing techniques normally applied to n-gram models (Kneser and Ney, 1995), enabling us to apply techniques developed for scaling 71- gram models out of the box (Brants et al., 2007; Pauls and Klein, 2011).
Treelet Language Modeling
The common denominator of most n-gram language models is that they assign probabilities roughly according to empirical frequencies for observed 77.-grams, but fall back to distributions conditioned on smaller contexts for unobserved n-grams, as shown in Figure 1(a).
Treelet Language Modeling
As in the n-gram case, we would like to pick h to be large enough to capture relevant dependencies, but small enough that we can obtain meaningful estimates from data.
Treelet Language Modeling
Although it is tempting to think that we can replace the left-to-right generation of n-gram models with the purely top-down generation of typical PCFGs, in practice, words are often highly predictive of the words that follow them — indeed, n-gram models would be terrible language models if this were not the case.
n-gram is mentioned in 15 sentences in this paper.
Topics mentioned in this paper:
Ceylan, Hakan and Kim, Yookyung
Language Identification
We implement a statistical model using a character based n-gram feature.
Language Identification
For each language, we collect the n-gram counts (for n = l to n = 7 also using the word beginning and ending spaces) from the vocabulary of the training corpus, and then generate a probability distribution from these counts.
Language Identification
In other words, this time we used a word-based n-gram method, only with n = 1.
Related Work
Most of the work carried out to date on the written language identification problem consists of supervised approaches that are trained on a list of words or n-gram models for each reference language.
Related Work
The n-gram based approaches are based on the counts of character or byte n-grams, which are sequences of n characters or bytes, extracted from a corpus for each reference language.
Related Work
classification models that use the n-gram features have been proposed.
n-gram is mentioned in 11 sentences in this paper.
Topics mentioned in this paper:
Zhang, Hao and Gildea, Daniel
Abstract
We take a multi-pass approach to machine translation decoding when using synchronous context-free grammars as the translation model and n-gram language models: the first pass uses a bigram language model, and the resulting parse forest is used in the second pass to guide search with a trigram language model.
Decoding to Maximize BLEU
BLEU is based on n-gram precision, and since each synchronous constituent in the tree adds a new 4-gram to the translation at the point where its children are concatenated, the additional pass approximately maximizes BLEU.
Introduction
This complexity arises from the interaction of the tree-based translation model with an n-gram language model.
Language Model Integrated Decoding for SCFG
We begin by introducing Synchronous Context Free Grammars and their decoding algorithms when an n-gram language model is integrated into the grammatical search space.
Language Model Integrated Decoding for SCFG
Without an n-gram language model, decoding using SCFG is not much different from CFG parsing.
Language Model Integrated Decoding for SCFG
However, when we want to integrate an n-gram language model into the search, our goal is searching for the derivation whose total sum of weights of productions and n-gram log probabilities is maximized.
Multi-pass LM-Integrated Decoding
We take the same view as in speech recognition that a trigram integrated model is a finer-grained model than bigram model and in general we can do an n — l-gram decoding as a predicative pass for the following n-gram pass.
Multi-pass LM-Integrated Decoding
We can make the approximation even closer by incorporating local higher-order outside n-gram information for a state of X[z',j, u1,,,,n_1, v1,,,,n_1] into account.
Multi-pass LM-Integrated Decoding
where 6 is the Viterbi inside cost and 04 is the Viterbi outside cost, to globally prioritize the n-gram integrated states on the agenda for exploration.
n-gram is mentioned in 12 sentences in this paper.
Topics mentioned in this paper:
Roark, Brian and Allauzen, Cyril and Riley, Michael
Abstract
We present an algorithm for re-estimating parameters of backoff n-gram language models so as to preserve given marginal distributions, along the lines of well-known Kneser-Ney (1995) smoothing.
Abstract
We present experimental results for heavily pruned backoff n-gram models, and demonstrate perplexity and word error rate reductions when used with various baseline smoothing methods.
Introduction
Smoothed n-gram language models are the defacto standard statistical models of language for a wide range of natural language applications, including speech recognition and machine translation.
Introduction
Such models are trained on large text corpora, by counting the frequency of n-gram collocations, then normalizing and smoothing (regularizing) the resulting multinomial distributions.
Introduction
Briefly, the smoothing method reesti-mates lower-order n-gram parameters in order to avoid overestimating the likelihood of n-grams that already have ample probability mass allocated as part of higher-order n-grams.
Preliminaries
N-gram language models are typically presented mathematically in terms of words 212, the strings (histories) h that precede them, and the suffixes of the histories (backoffs) h’ that are used in the smoothing recursion.
Preliminaries
where is the count of the n-gram sequence 10H, .
Preliminaries
N-gram language models allow for a sparse representation, so that only a subset of the possible n-grams must be explicitly stored.
n-gram is mentioned in 42 sentences in this paper.
Topics mentioned in this paper:
Xiao, Tong and Zhu, Jingbo and Zhu, Muhua and Wang, Huizhen
Background
The computation of t/I(e,H(v)) is based on a linear combination of a set of n-gram consensuses-based features.
Background
For each order of n-gram, h; (e,H(v)) and h”— (e,H(v)) are defined to measure the n-gram agreement and disagreement between e and other translation candidates in H(v), respectively.
Background
If p orders of n-gram are used in computing t//(e,H(v)) , the total number of features in the system combination will be T +2>< p (T model-score-based features defined in Equation 8 and 2x p consensus-based features defined in Equation 9).
n-gram is mentioned in 11 sentences in this paper.
Topics mentioned in this paper:
Talbot, David and Brants, Thorsten
Abstract
The scheme can represent any standard n-gram model and is easily combined with existing model reduction techniques such as entropy-pruning.
Introduction
LMs are usually implemented as n-gram models parameterized for each distinct sequence of up to n words observed in the training corpus.
Introduction
Recent work (Talbot and Osborne, 2007b) has demonstrated that randomized encodings can be used to represent n-gram counts for LMs with signficant space-savings, circumventing information-theoretic constraints on lossless data structures by allowing errors with some small probability.
Introduction
Lookup is very efficient: the values of 3 cells in a large array are combined with the fingerprint of an n-gram .
Perfect Hash-based Language Models
The model can erroneously return a value for an n-gram that was never actually stored, but will always return the correct value for an n-gram that is in the model.
Perfect Hash-based Language Models
Each n-gram sci is drawn from some set of possible n-grams LI and its associated value from a corresponding set of possible values V.
Perfect Hash-based Language Models
We do not store the n-grams and their probabilities directly but rather encode a fingerprint of each n-gram f together with its associated value in such a way that the value can be retrieved when the model is queried with the n-gram sci.
Scaling Language Models
These are typically n-gram models that approximate the probability of a word sequence by assuming each token to be independent of all but n — l preceding tokens.
Scaling Language Models
Although n-grams observed in natural language corpora are not randomly distributed within this universe no lossless data structure that we are aware of can circumvent this space-dependency on both the n-gram order and the vocabulary size.
Scaling Language Models
While the approach results in significant space savings, working with corpus statistics, rather than n-gram probabilities directly, is computationally less efficient (particularly in a distributed setting) and introduces a dependency on the smoothing scheme used.
n-gram is mentioned in 36 sentences in this paper.
Topics mentioned in this paper:
Chong, Tze Yuang and E. Banchs, Rafael and Chng, Eng Siong and Li, Haizhou
Abstract
We attempt to extract this information from history-contexts of up to ten words in size, and found it complements well the n-gram model, which inherently suffers from data scarcity in learning long histo—ry-contexts.
Introduction
The commonly used n-gram model (Bahl et a1.
Introduction
Although n-gram models are simple and effective, modeling long history-contexts lead to severe data scarcity problems.
Language Modeling with TD and TO
The prior, which is usually implemented as a unigram model, can be also replaced with a higher order n-gram model as, for instance, the bigram model:
Language Modeling with TD and TO
Replacing the unigram model with a higher order n-gram model is important to compensate the damage incurred by the conditional independence assumption made earlier.
Motivation of the Proposed Approach
In the n-gram model, for example, these two attributes are jointly taken into account in the ordered word-sequence.
Motivation of the Proposed Approach
Consequently, the n-gram model can only be effectively implemented within a short history-context (e. g. of size of three or four).
Motivation of the Proposed Approach
However, intermediate distances beyond the n-gram model limits can be very useful and should not be discarded.
Related Work
2007) disassembles the n-gram into (n—l) word-pairs, such that each pair is modeled by a distance-k bigram model, where 1 S k s n — 1 .
n-gram is mentioned in 20 sentences in this paper.
Topics mentioned in this paper:
Persing, Isaac and Ng, Vincent
Error Classification
While we employ seven types of features (see Sections 4.2 and 4.3), only the word n-gram features are subject to feature selection.2 Specifically, we employ
Error Classification
the top n,- n-gram features as selected according to information gain computed over the training data (see Yang and Pedersen (1997) for details).
Error Classification
Aggregated word n-gram features.
Evaluation
Our Baseline system, which only uses word n-gram and random indexing features, seems to perform uniformly poorly across both micro and macro F-scores (F and F; see row 1).
Score Prediction
6Before tuning the feature selection parameter, we have to sort the list of n-gram features occurring the training set.
n-gram is mentioned in 14 sentences in this paper.
Topics mentioned in this paper:
Chan, Yee Seng and Ng, Hwee Tou
Automatic Evaluation Metrics
number of n-gram matches of the system translation against one or more reference translations.
Automatic Evaluation Metrics
Generally, more n-gram matches result in a higher BLEU score.
Automatic Evaluation Metrics
When determining the matches to calculate precision, BLEU uses a modified, or clipped n-gram precision.
Metric Design Considerations
4.1 Using N-gram Information
Metric Design Considerations
Lemma and POS match Representing each n-gram by its sequence of lemma and POS-tag pairs, we first try to perform an exact match in both lemma and POS-tag.
Metric Design Considerations
In all our n-gram matching, each n-gram in the system translation can only match at most one n-gram in the reference translation.
n-gram is mentioned in 14 sentences in this paper.
Topics mentioned in this paper:
Pauls, Adam and Klein, Dan
Abstract
Our most compact representation can store all 4 billion n-grams and associated counts for the Google n-gram corpus in 23 bits per n-gram , the most compact lossless representation to date, and even more compact than recent lossy compression techniques.
Introduction
and Raj, 2001; Hsu and Glass, 2008), our methods are conceptually based on tabular trie encodings wherein each n-gram key is stored as the concatenation of one word (here, the last) and an offset encoding the remaining words (here, the context).
Language Model Implementations
For an n-gram language model, we can apply this implementation with a slight modification: we need n sorted arrays, one for each n-gram order.
Language Model Implementations
Because our keys are sorted according to their context-encoded representation, we cannot straightforwardly answer queries about an n-gram w without first determining its context encoding.
Preliminaries
Our goal in this paper is to provide data structures that map n-gram keys to values, i.e.
Preliminaries
However, because of the sheer number of keys and values needed for n-gram language modeling, generic implementations do not work efficiently “out of the box.” In this section, we will review existing techniques for encoding the keys and values of an n-gram language model, taking care to account for every bit of memory required by each implementation.
Preliminaries
In the WeblT corpus, the most frequent n-gram occurs about 95 billion times.
n-gram is mentioned in 25 sentences in this paper.
Topics mentioned in this paper:
Bramsen, Philip and Escobar-Molano, Martha and Patel, Ami and Alonso, Rafael
Abstract
Each n-gram is a sequence of words, POS tags or a combination of words and POS tags
Abstract
To illustrate, consider the following feature set, a bigram and a trigram (each term in the n-gram either has the form word or Atag):
Abstract
The first n-gram of the set, please AVB, would match pleaseARB bringAVB from the text.
n-gram is mentioned in 21 sentences in this paper.
Topics mentioned in this paper:
Liu, Jenny and Haghighi, Aria
Abstract
We compare our error rates to the state-of-the-art and to a strong Google n-gram count baseline.
Abstract
We attain a maximum error reduction of 69.8% and average error reduction across all test sets of 59.1% compared to the state-of-the-art and a maximum error reduction of 68.4% and average error reduction across all test sets of 41.8% compared to our Google n-gram count baseline.
Experiments
We keep a table mapping each unique n-gram to the number of times it has been seen in the training data.
Experiments
4.2 Google n-gram Baseline
Experiments
The Google n-gram corpus is a collection of n-gram counts drawn from public webpages with a total of one trillion tokens — around 1 billion each of unique 3-grams, 4—grams, and 5-grams, and around 300,000 unique bigrams.
Results
MAXENT also outperforms the GOOGLE N-GRAM baseline for almost all test corpora and sequence lengths.
Results
For the Switchboard test corpus token and type accuracies, the GOOGLE N-GRAM baseline is more accurate than MAXENT for sequences of length 2 and overall, but the accuracy of MAXENT is competitive with that of GOOGLE N-GRAM .
Results
MAXENT also attains a maximum error reduction of 68.4% for the WSJ test corpus and an average error reduction of 41.8% when compared to GOOGLE N-GRAM .
n-gram is mentioned in 15 sentences in this paper.
Topics mentioned in this paper:
Kozareva, Zornitsa
Task A: Polarity Classification
4.4 N-gram Evaluation and Results
Task A: Polarity Classification
N-gram features are widely used in a variety of classification tasks, therefore we also use them in our polarity classification task.
Task A: Polarity Classification
Figure 2 shows a study of the influence of the different information sources and their combination with n-gram features for English.
Task B: Valence Prediction
The Farsi and Russian regression models are based only on n-gram features, while the English and Spanish regression models have both n-gram and LIWC features.
n-gram is mentioned in 9 sentences in this paper.
Topics mentioned in this paper:
Ravi, Sujith and Knight, Kevin
Conclusion
Unlike previous approaches, our method combines information from letter n-gram language models and word dictionaries and provides a robust decipherment model.
Decipherment
Combining letter n-gram language models with word dictionaries: Many existing probabilistic approaches use statistical letter n-gram language models of English to assign P (p) probabilities to plaintext hypotheses during decipherment.
Decipherment
3We set the interpolation weights for the word and n-gram LM as (0.9, 0.1).
Decipherment
We train the letter n-gram LM on 50 million words of English text available from the Linguistic Data Consortium.
Experiments and Results
EM Method using letter n-gram LMs following the approach of Knight et al.
Experiments and Results
0 Letter n-gram versus W0rd+n-gram LMs—Figure 2 shows that using a word+3-gram LM instead of a 3-gram LM results in +75% improvement in decipherment accuracy.
Introduction
(2006) use the Expectation Maximization (EM) algorithm (Dempster et al., 1977) to search for the best probabilistic key using letter n-gram models.
Introduction
Ravi and Knight (2008) formulate decipherment as an integer programming problem and provide an exact method to solve simple substitution ciphers by using letter n-gram models along with deterministic key constraints.
Introduction
0 Our new method combines information from word dictionaries along with letter n-gram models, providing a robust decipherment model which offsets the disadvantages faced by previous approaches.
n-gram is mentioned in 9 sentences in this paper.
Topics mentioned in this paper:
Bansal, Mohit and Klein, Dan
Analysis
The mid-words m which rank highly are those where the occurrence of hma as an n-gram is a good indicator that a attaches to h (m of course does not have to actually occur in the sentence).
Analysis
However, we additionally find other cues, most notably that if the N IN sequence occurs following a capitalized determiner, it tends to indicate a nominal attachment (in the n-gram , the preposition cannot attach leftward to anything else because of the beginning of the sentence).
Analysis
These features essentially say that if two heads ml and 2122 occur in the direct coordination n-gram ml and wg, then they are good heads to coordinate (coordination unfortunately looks the same as complementation or modification to a basic dependency model).
Introduction
(2010), which use Web-scale n-gram counts for multi-way noun bracketing decisions, though that work considers only sequences of nouns and uses only affinity-based web features.
Working with Web n-Grams
Lapata and Keller (2004) uses the number of page hits as the web-count of the queried n-gram (which is problematic according to Kilgarriff (2007)).
Working with Web n-Grams
Rather than working through a search API (or scraper), we use an offline web corpus — the Google n-gram corpus (Brants and Franz, 2006) — which contains English n-grams (n = l to 5) and their observed frequency counts, generated from nearly 1 trillion word tokens and 95 billion sentences.
Working with Web n-Grams
Our system requires the counts from a large collection of these n-gram queries (around 4.5 million).
n-gram is mentioned in 9 sentences in this paper.
Topics mentioned in this paper:
Chen, David
Background
To learn the meaning of an n-gram 21), Chen and Mooney first collect all navigation plans 9 that co-occur with w. This forms the initial candidate meaning set for 21).
Conclusion
In contrast to the previous approach that computed common subgraphs between different contexts in which an n-gram appeared, we instead focus on small, connected subgraphs and introduce an algorithm, SGOLL, that is an order of magnitude faster.
Online Lexicon Learning Algorithm
Even though they use beam-search to limit the size of the candidate set, if the initial candidate meaning set for a n-gram is large, it can take a long time to take just one pass through the list of all candidates.
Online Lexicon Learning Algorithm
: function Update(training example (67;, 1%)) for n-gram w that appears in 67; do
Online Lexicon Learning Algorithm
14: Increase the count of examples, each n-gram w and each subgraph g 15: end function
n-gram is mentioned in 9 sentences in this paper.
Topics mentioned in this paper:
Liu, Chang and Ng, Hwee Tou
Discussion and Future Work
The TESLA-M metric allows each n-gram to have a weight, which is primarily used to discount function words.
Experiments
Based on these synonyms, TESLA-CELAB is able to award less trivial n-gram matches, such as T fiflifififl.
Experiments
The covered n-gram matching rule is then able to award tricky n-grams such as TE, Ti, /1\ [E], 1/13 [IE5 and i9}.
Motivation
We formulate the n-gram matching process as a real-valued linear programming problem, which can be solved efficiently.
The Algorithm
The basic n-gram matching problem is shown in Figure 2.
The Algorithm
We observe that since has been matched, all its sub-n-grams should be considered matched as well, including and We call this the covered n-gram matching rule.
The Algorithm
However, we cannot simply perform covered n-gram matching as a post processing step.
n-gram is mentioned in 9 sentences in this paper.
Topics mentioned in this paper:
Wintrode, Jonathan and Khudanpur, Sanjeev
Abstract
We aim to improve spoken term detection performance by incorporating contextual information beyond traditional N-gram language models.
Conclusions
Using word repetitions, we effectively use a broad document context outside of the typical 2-5 N-gram window.
Introduction
ASR systems traditionally use N-gram language models to incorporate prior knowledge of word occurrence patterns into prediction of the next word in the token stream.
Introduction
N-gram models cannot, however, capture complex linguistic or topical phenomena that occur outside the typical 3-5 word scope of the model.
Introduction
Confidence scores from an ASR system (which incorporate N-gram probabilities) are optimized in order to produce the most likely sequence of words rather than the accuracy of individual word detections.
Motivation
We seek a workable definition of broad document context beyond N-gram models that will improve term detection performance on an arbitrary set of queries.
Motivation
A number of efforts have been made to augment traditional N-gram models with latent topic information (Khudanpur and Wu, 1999; Florian and Yarowsky, 1999; Liu and Liu, 2008; Hsu and Glass, 2006; Naptali et al., 2012) including some of the early work on Probabilistic Latent Semantic Analysis by Hofmann (2001).
Term and Document Frequency Statistics
In applying the burstiness quantity to term detection, we recall that the task requires us to locate a particular instance of a term, not estimate a count, hence the utility of N-gram language models predicting words in sequence.
n-gram is mentioned in 9 sentences in this paper.
Topics mentioned in this paper:
Bergsma, Shane and Lin, Dekang and Goebel, Randy
Evaluation
We smooth by adding 40 to all counts, equal to the minimum count in the n-gram data.
Introduction
For example, in our n-gram collection (Section 3.4), “make it in advance” and “make them in advance” occur roughly the same number of times (442 vs. 449), indicating a referential pattern.
Methodology
We gather pattern fillers from a large collection of n-gram frequencies.
Methodology
We do the same processing to our n-gram corpus.
Methodology
Also, other pronouns in the pattern are allowed to match a corresponding pronoun in an n-gram , regardless of differences in inflection and class.
n-gram is mentioned in 9 sentences in this paper.
Topics mentioned in this paper:
Ma, Ji and Zhang, Yue and Zhu, Jingbo
Experiments
For each data set, we investigate an extensive set of combinations of hyper-parameters: the n-gram window (l,r) in {(1, 1), (2,1), (1,2), (2,2)}; the hidden layer size in {200, 300, 400}; the learning rate in {0.1, 0.01, 0.001}.
Experiments
5.3.2 Word and N-gram Representation
Experiments
By contrast, using n-gram representations improves the performance on both oov and non-oov.
Learning from Web Text
The basic idea is to share word representations across different positions in the input n-gram while using position-dependent weights to distinguish between different word orders.
Learning from Web Text
Let V0) represents the j-th visible variable of the WRRBM, which is a vector of length Then V0) 2 wk; means that the j-th word in the n-gram is wk.
Neural Network for POS Disambiguation
The input for the this module is the word n-gram (7,0,4, .
Neural Network for POS Disambiguation
representations of the input n-gram .
Related Work
While those approaches mainly explore token-level representations (word or character embeddings), using WRRBM is able to utilize both word and n-gram representations.
n-gram is mentioned in 8 sentences in this paper.
Topics mentioned in this paper:
Tang, Hao and Keshet, Joseph and Livescu, Karen
Feature functions
We use 1‘91.” to denote the n-gram substring p1 .
Feature functions
The two substrings E and b are said to be equal if they have the same length and a7; 2 ()7; for 1 S i S n. For a given sub-word unit n-gram U E 73’", we use the shorthand U 6 T9 to mean that we can find U in 1‘9; i.e., there eXists an indeX i such that my”, 2 H. We use |f9| to denote the length of the sequence 1‘9.
Feature functions
Similarly to (Zweig et al., 2010), we adapt TF and IDF by treating a sequence of sub-word units as a “document” and n-gram sub-sequences as “words.” In this analogy, we use sub-sequences in surface pronunciations to “search” for baseforms in the dictionary.
n-gram is mentioned in 8 sentences in this paper.
Topics mentioned in this paper:
Kauchak, David
Introduction
Table 1 shows the n-gram overlap proportions in a sentence aligned data set of 137K sentence pairs from aligning Simple English Wikipedia and English Wikipedia articles (Coster and Kauchak, 2011a).1 The data highlights two conflicting views: does the benefit of additional data outweigh the problem of the source of the data?
Introduction
n-gram size: 1 2 3 4 5 simple in normal 0.96 0.80 0.68 0.61 0.55 normal in simple 0.87 0.68 0.58 0.51 0.46
Why Does Unsimplified Data Help?
trained using a smoothed version of the maximum likelihood estimate for an n-gram .
Why Does Unsimplified Data Help?
count(bc) where count(-) is the number of times the n-gram occurs in the training corpus.
Why Does Unsimplified Data Help?
For interpolated and backoff n-gram models, these counts are smoothed based on the probabilities of lower order n-gram models, which are inturn calculated based on counts from the corpus.
n-gram is mentioned in 8 sentences in this paper.
Topics mentioned in this paper:
Kalchbrenner, Nal and Grefenstette, Edward and Blunsom, Phil
Experiments
The NBoW performs similarly to the non-neural n-gram based classifiers.
Experiments
Besides the RecNN that uses an external parser to produce structural features for the model, the other models use n-gram based or neural features that do not require external resources or additional annotations.
Experiments
We see a significant increase in the performance of the DCNN with respect to the non-neural n-gram based classifiers; in the presence of large amounts of training data these classifiers constitute particularly strong baselines.
Introduction
Convolving the same filter with the n-gram at every position in the sentence allows the features to be extracted independently of their position in the sentence.
Properties of the Sentence Model
4.1 Word and n-Gram Order
Properties of the Sentence Model
For most applications and in order to learn fine-grained feature detectors, it is beneficial for a model to be able to discriminate whether a specific n-gram occurs in the input.
Properties of the Sentence Model
2.3, the Max-TDNN is sensitive to word order, but max pooling only picks out a single n-gram feature in each row of the sentence matrix.
n-gram is mentioned in 8 sentences in this paper.
Topics mentioned in this paper:
Chen, Wenliang and Zhang, Min and Li, Haizhou
Dependency language model
The standard N-gram based language model predicts the next word based on the N — 1 immediate previous words.
Dependency language model
However, the traditional N-gram language model can not capture long-distance word relations.
Dependency language model
The N-gram DLM predicts the next child of a head based on the N — 1 immediate previous children and the head itself.
Experiments
Then, we studied the effect of adding different N-gram DLMs to MSTl.
Introduction
The N-gram DLM has the ability to predict the next child based on the N-l immediate previous children and their head (Shen et al., 2008).
Introduction
The DLM-based features can capture the N-gram information of the parent-children structures for the parsing model.
n-gram is mentioned in 7 sentences in this paper.
Topics mentioned in this paper:
Ravi, Sujith and Knight, Kevin
Machine Translation as a Decipherment Task
For P (e), we use a word n-gram LM trained on monolingual English data.
Machine Translation as a Decipherment Task
Whole-segment Language Models: When using word n-gram models of English for decipherment, we find that some of the foreign sentences are decoded into sequences (such as “THANK YOU TALKING ABOUT ‘2”) that are not good English.
Machine Translation as a Decipherment Task
This stems from the fact that n-gram LMs have no global information about what constitutes a valid English segment.
Word Substitution Decipherment
We model P(e) using a statistical word n-gram English language model (LM).
Word Substitution Decipherment
Our method holds several other advantages over the EM approach—(l) inference using smart sampling strategies permits efficient training, allowing us to scale to large data/vocabulary sizes, (2) incremental scoring of derivations during sampling allows efficient inference even when we use higher-order n-gram LMs, (3) there are no memory bottlenecks since the full channel model and derivation lattice are never instantiated during training, and (4) prior specification allows us to learn skewed distributions that are useful here—word substitution ciphers exhibit l-to-l correspondence between plaintext and cipher types.
Word Substitution Decipherment
build an English word n-gram LM, which is used in the decipherment process.
n-gram is mentioned in 7 sentences in this paper.
Topics mentioned in this paper:
Baldwin, Tyler and Li, Yunyao and Alexe, Bogdan and Stanoi, Ioana R.
Experimental Evaluation
N-gram suggests no n-referential instances
Experimental Evaluation
Effectiveness To understand the contribution of the n-gram (NG), ontology (ON), and clustering (CL) based modules, we ran each separately, as well as every possible combination.
Experimental Evaluation
Of the three individual modules, the n-gram and clustering methods achieve F-measure of around 0.9, while the ontology-based module performs only modestly above baseline.
Term Ambiguity Detection (TAD)
This module examines n-gram data from a large text collection.
Term Ambiguity Detection (TAD)
The rationale behind the n-gram module is based on the understanding that terms appearing in non-named entity contexts are likely to be non-referential, and terms that can be non-referential are ambiguous.
Term Ambiguity Detection (TAD)
Since we wish for the ambiguity detection determination to be fast, we develop our method to make this judgment solely on the n-gram probability, without the need to examine each individual usage context.
n-gram is mentioned in 7 sentences in this paper.
Topics mentioned in this paper:
Zhang, Hui and Chiang, David
Smoothing on count distributions
On integral counts, this is simple: we generate, for each n-gram type vu’w, an (n — l)—gram token u’w, for a total of n1+(ou’w) tokens.
Smoothing on count distributions
Analogously, on count distributions, for each n-gram type vu’w, we generate an (n — l)-gram token u’w with probability p(c(vu’w) > 0).
Smoothing on count distributions
Using the dynamic program in Section 3.2, computing the distributions for each r is linear in the number of n-gram types, and we only need to compute the distributions up to r = 2 (or r = 4 for modified KN), and store them for r = 0 (or up to r = 2 for modified KN).
Smoothing on integral counts
Let uw stand for an n-gram , where u stands for the (n — l) context words and w, the predicted word.
Smoothing on integral counts
where n1+(uo) = |{w | C(uw) > OH is the number of word types observed after context u, and qu(w) specifies how to distribute the subtracted discounts among unseen n-gram types.
Smoothing on integral counts
The probability of an n-gram token uw using the other tokens as training data is
n-gram is mentioned in 7 sentences in this paper.
Topics mentioned in this paper:
Thadani, Kapil
Abstract
Sentence compression has been shown to benefit from joint inference involving both n-gram and dependency-factored objectives but this typically requires expensive integer programming.
Experiments
0 ILP-Dep: A version of the joint ILP of Thadani and McKeown (2013) without n-gram variables and corresponding features.
Experiments
Starting with the n-gram approaches, the performance of 3-LM leads us to observe that the gains of supervised learning far outweigh the utility of higher-order n- gram factorization, which is also responsible for a significant increase in wall-clock time.
Experiments
We were surprised by the strong performance of the dependency-based inference techniques, which yielded results that approached the joint model in both n-gram and parse-based measures.
Introduction
Following an assumption often used in compression systems, the compressed output in this corpus is constructed by dropping tokens from the input sentence without any paraphrasing or reordering.1 A number of diverse approaches have been proposed for deletion-based sentence compression, including techniques that assemble the output text under an n-gram factorization over the input text (McDonald, 2006; Clarke and Lapata, 2008) or an arc factorization over input dependency parses (Filippova and Strube, 2008; Galanis and Androutsopoulos, 2010; Filippova and Altun, 2013).
Introduction
In this work, we develop approximate inference strategies to the joint approach of Thadani and McKeown (2013) which trade the optimality guarantees of exact ILP for faster inference by separately solving the n-gram and dependency subproblems and using Lagrange multipliers to enforce consistency between their solutions.
Multi-Structure Sentence Compression
Let a(y) E {0, 1}” denote the incidence vector of tokens contained in the n-gram sequence y and ,6(z) E {0, 1}” denote the incidence vector of words contained in the dependency tree 2.
n-gram is mentioned in 7 sentences in this paper.
Topics mentioned in this paper:
Uszkoreit, Jakob and Brants, Thorsten
Class-Based Language Modeling
Generalizing this leads to arbitrary order class-based n-gram models of the form:
Introduction
In the case of n-gram language models this is done by factoring the probability:
Introduction
do not differ in the last n — 1 words, one problem n-gram language models suffer from is that the training data is too sparse to reliably estimate all conditional probabilities P(w,~ lwzf 1).
Introduction
However, in the area of statistical machine translation, especially in the context of large training corpora, fewer experiments with class-based n-gram models have been performed with mixed success (Raab, 2006).
n-gram is mentioned in 6 sentences in this paper.
Topics mentioned in this paper:
Lee, John and Seneff, Stephanie
Abstract
To improve recall, irregularities in parse trees caused by verb form errors are taken into account; to improve precision, n-gram counts are utilized to filter proposed corrections.
Experiments
For those categories with a high rate of false positives (all except BASEmd, BASEdO and FINITE), we utilized n-grams as filters, allowing a correction only when its n-gram count in the WEB 1T 5-GRAM
Experiments
Some kind of confidence measure on the n-gram counts might be appropriate for reducing such false alarms.
Introduction
To improve recall, irregularities in parse trees caused by verb form errors are considered; to improve precision, n-gram counts are utilized to filter proposed corrections.
Research Issues
We propose using n-gram counts as a filter to counter this kind of overgeneralization.
Research Issues
A second goal is to show that n-gram counts can eflectively serve as a filter; in order to increase precision.
n-gram is mentioned in 6 sentences in this paper.
Topics mentioned in this paper:
Li, Haibo and Zheng, Jing and Ji, Heng and Li, Qi and Wang, Wen
Baseline MT
The LM used for decoding is a log-linear combination of four word n-gram LMs which are built on different English
Name-aware MT Evaluation
where wn is a set of positive weights summing to one and usually uniformly set as 712,, = l/N, c is the length of the system translation and 7“ is the length of reference translation, and pn is modified n-gram precision defined as: Z Z Countelipm-gram)
Name-aware MT Evaluation
As in BLEU metric, we first count the maximum number of times an n-gram occurs in any single reference translation.
Name-aware MT Evaluation
The weight of an n-gram in reference translation is the sum of weights of all tokens it contains.
n-gram is mentioned in 6 sentences in this paper.
Topics mentioned in this paper:
Chambers, Nathanael and Jurafsky, Daniel
How Frequent is Unseen Data?
The third line across the bottom of the figure is the number of unseen pairs using Google n-gram data as proxy argument counts.
How Frequent is Unseen Data?
Creating argument counts from n-gram counts is described in detail below in section 5.2.
Models
Using the Google n-gram corpus, we recorded all verb-noun co-occurrences, defined by appearing in any order in the same n-gram , up to and including 5-grams.
Models
For instance, the test pair (throwsubject, ball) is considered seen if there exists an n-gram such that throw and ball are both included.
Models
C (vd, n) as the number of times 2) and n (ignoring d) appear in the same n-gram .
Results
The Google n-gram backoff model is almost as good as backing off to the Erk smoothing model.
n-gram is mentioned in 6 sentences in this paper.
Topics mentioned in this paper:
Espinosa, Dominic and White, Michael and Mehay, Dennis
Background
makes use of n-gram language models over words represented as vectors of factors, including surface form, part of speech, supertag and semantic class.
Background
In the anytime mode, a best-first search is performed with a con-figurable time limit: the scores assigned by the n-gram model determine the order of the edges on the agenda, and thus have an impact on realization speed.
Background
the one that covers the most elementary predications in the input logical form, with ties broken according to the n-gram score.
The Approach
Table 1: Percentage of complete realizations using an oracle n-gram model versus the best performing factored language model.
n-gram is mentioned in 6 sentences in this paper.
Topics mentioned in this paper:
Chen, Yanping and Zheng, Qinghua and Zhang, Wei
Feature Construction
Despite the Omni-word can be seen as a subset of n-Gram feature.
Feature Construction
It is not the same as the n-Gram feature.
Feature Construction
N-Gram features are more fragmented.
n-gram is mentioned in 5 sentences in this paper.
Topics mentioned in this paper:
Escalante, Hugo Jair and Solorio, Thamar and Montes-y-Gomez, Manuel
Experiments and Results
For our character n-gram experiments, we obtained LOWBOW representations for character 3-grams (only n-grams of size n = 3 were used) considering the 2, 500 most common n-grams.
Experiments and Results
Also, n-gram information is more dense in documents than word-level information.
Related Work
(2003) propose the use of language models at the n-gram character-level for AA, whereas Keselj et al.
Related Work
on characters at the n-gram level (Plakias and Stamatatos, 2008a).
Related Work
Acceptable performance in AA has been reported with character n-gram representations.
n-gram is mentioned in 5 sentences in this paper.
Topics mentioned in this paper:
Celikyilmaz, Asli and Hakkani-Tur, Dilek
Data and Approach Overview
(Ilia) Entity List Prior (Illa) Web N-Gram Context Prior
Data and Approach Overview
n-gram ——> web quer logs domain specific entity act parameters prior \
Experiments
* Base—MCM: Our first version injects an informative prior for domain, dialog act and slot topic distributions using information extracted from only labeled training utterances and inject as prior constraints (corpus n-gram base measure during topic assignments.
MultiLayer Context Model - MCM
* Web n-Gram Context Base Measure (2%): As explained in §3, we use the web n-grams as additional information for calculating the base measures of the Dirichlet topic distributions.
MultiLayer Context Model - MCM
* Corpus n-Gram Base Measure (2%): Similar to other measures, MCM also encodes n-gram constraints as word-frequency features extracted from labeled utterances.
n-gram is mentioned in 5 sentences in this paper.
Topics mentioned in this paper:
Carpuat, Marine and Daume III, Hal and Henry, Katharine and Irvine, Ann and Jagarlamudi, Jagadeesh and Rudinger, Rachel
New Sense Indicators
N-gram Probability Features The goal of the Type:NgramProb feature is to capture the fact that “unusual contexts” might imply new senses.
New Sense Indicators
To capture this, we can look at the log probability of the word under consideration given its N-gram context, both according to an old-domain language model (call this 6°“) and a new-domain language
New Sense Indicators
From these four values, we compute corpus-level (and therefore type-based) statistics of the new domain n-gram log probability (Eflgw, the difference between the n-gram probabilities in each domain (623” — 6:51), the difference between the n-gram and unigram probabilities in the new domain (EQSW — 633‘”), and finally the combined difference: 623"” — [SSW + 63:: — 635’).
n-gram is mentioned in 5 sentences in this paper.
Topics mentioned in this paper:
Zaslavskiy, Mikhail and Dymetman, Marc and Cancedda, Nicola
Future Work
This powerful method, which was proposed in (Kam and Kopec, 1996; Popat et al., 2001) in the context of a finite-state model (but not of TSP), can be easily extended to N-gram situations, and typically converges in a small number of iterations.
Introduction
Typical nonlocal features include one or more n-gram language models as well as a distortion feature, measuring by how much the order of biphrases in the candidate translation deviates from their order in the source sentence.
Phrase-based Decoding as TSP
4.1 From Bigram to N-gram LM
Phrase-based Decoding as TSP
If we want to extend the power of the model to general n-gram language models, and in particular to the 3-gram
Phrase-based Decoding as TSP
The problem becomes even worse if we extend the compiling-out method to n-gram language models with n > 3.
n-gram is mentioned in 5 sentences in this paper.
Topics mentioned in this paper:
Turian, Joseph and Ratinov, Lev-Arie and Bengio, Yoshua
Distributed representations
For each training update, we read an n-gram x = (W1, .
Distributed representations
We also create a corrupted or noise n-gram J?
Distributed representations
0 We corrupt the last word of each n-gram .
n-gram is mentioned in 5 sentences in this paper.
Topics mentioned in this paper:
Liu, Yang
Abstract
As the algorithm generates dependency trees for partial translations left-to-right in decoding, it allows for efficient integration of both n-gram and dependency language models.
Introduction
In addition, it is straightforward to integrate n-gram language models into phrase-based decoders in which translation always grows left-to-right.
Introduction
Unfortunately, as syntax-based decoders often generate target-language words in a bottom-up way using the CKY algorithm, integrating n-gram language models becomes more expensive because they have to maintain target boundary words at both ends of a partial translation (Chiang, 2007; Huang and Chiang, 2007).
Introduction
3. efficient integration of n-gram language model: as translation grows left-to-right in our algorithm, integrating n-gram language models is straightforward.
n-gram is mentioned in 5 sentences in this paper.
Topics mentioned in this paper:
Ott, Myle and Choi, Yejin and Cardie, Claire and Hancock, Jeffrey T.
Automated Approaches to Deceptive Opinion Spam Detection
In contrast to the other strategies just discussed, our text categorization approach to deception detection allows us to model both content and context with n-gram features.
Automated Approaches to Deceptive Opinion Spam Detection
Specifically, we consider the following three n-gram feature sets, with the corresponding features lowercased and unstemmed: UNIGRAMS, BIGRAMS+, TRIGRAMS+, where the superscript + indicates that the feature set subsumes the preceding feature set.
Automated Approaches to Deceptive Opinion Spam Detection
We consider all three n-gram feature sets, namely UNIGRAMS, BIGRAMS+, and TRIGRAMS+, with corresponding language models smoothed using the interpolated Kneser-Ney method (Chen and Goodman, 1996).
Introduction
Notably, a combined classifier with both n-gram and psychological deception features achieves nearly 90% cross-validated accuracy on this task.
Results and Discussion
Surprisingly, models trained only on UNIGRAMS—the simplest n-gram feature set—outperform all non—text-categorization approaches, and models trained on BIGRAMSJr perform even better (one-tailed sign test p = 0.07).
n-gram is mentioned in 5 sentences in this paper.
Topics mentioned in this paper:
Mukherjee, Arjun and Liu, Bing
Conclusion
A novel technique was also proposed to rank n-gram phrases where relevance based ranking was used in conjunction with a semi-supervised generative model.
Empirical Evaluation
The reduced dataset consists of 1095586 tokens (after n-gram preprocessing in §4), 40102 posts with an average of 27 posts or interactions per pair.
Model
Like most generative models for text, a post (document) is viewed as a bag of n-grams and each n-gram (word/phrase) takes one value from a predefined vocabulary.
Phrase Ranking based on Relevance
While this is reasonable, a significant n-gram with high likelihood score may not necessarily be relevant to the problem domain.
Phrase Ranking based on Relevance
This is nothing wrong per se because the statistical tests only judge significance of an n-gram, but a significant n-gram may not necessarily be relevant in a given problem domain.
n-gram is mentioned in 5 sentences in this paper.
Topics mentioned in this paper:
Chen, David and Dolan, William
Abstract
The highly parallel nature of this data allows us to use simple n-gram comparisons to measure both the semantic adequacy and lexical dissimilarity of paraphrase candidates.
Paraphrase Evaluation Metrics
Thus, no n-gram overlaps are required to determine the semantic adequacy of the paraphrase candidates.
Paraphrase Evaluation Metrics
In essence, it is the inverse of BLEU since we want to minimize the number of n-gram overlaps between the two sentences.
Paraphrase Evaluation Metrics
where N is the maximum n-gram considered and n-
n-gram is mentioned in 5 sentences in this paper.
Topics mentioned in this paper:
Hirao, Tsutomu and Suzuki, Jun and Isozaki, Hideki
A Syntax Free Sequence-oriented Sentence Compression Method
Many studies on sentence compression employ the n-gram language model to evaluate the linguistic likelihood of a compressed sentence.
A Syntax Free Sequence-oriented Sentence Compression Method
N-gram distribution of short sentences may different from that of long sentences.
A Syntax Free Sequence-oriented Sentence Compression Method
Therefore, the n-gram probability sometimes disagrees with our intuition in terms of sentence compression.
Experimental Evaluation
We developed the n-gram language model from a 9 year set of Mainichi Newspaper articles.
Results and Discussion
This result shows that the n-gram language model is improper for sentence compression because the n-gram probability is computed by using a corpus that includes both short and long sentences.
n-gram is mentioned in 5 sentences in this paper.
Topics mentioned in this paper:
Zhang, Dongdong and Li, Mu and Duan, Nan and Li, Chi-Ho and Zhou, Ming
Experiments
In the tables, Lm denotes the n-gram language model feature, T mh denotes the feature of collocation between target head words and the candidate measure word, Smh denotes the feature of collocation between source head words and the candidate measure word, HS denotes the feature of source head word selection, Punc denotes the feature of target punctuation position, T [ex denotes surrounding word features in translation, Slex denotes surrounding word features in source sentence, and Pas denotes Part-Of-Speech feature.
Introduction
In this case, an n-gram language model with n<15 cannot capture the MW-HW collocation.
Our Method
For target features, n-gram language model score is defined as the sum of log n-gram probabilities within the target window after the measure
Our Method
Target features Source features n-gram language model MW-HW collocation score MW-HW collocation surrounding words surrounding words source head word punctuation position POS tags
n-gram is mentioned in 4 sentences in this paper.
Topics mentioned in this paper:
Hasegawa, Takayuki and Kaji, Nobuhiro and Yoshinaga, Naoki and Toyoda, Masashi
Experiments
RESPONSE The n-gram and emotion features induced from the response.
Experiments
The n-gram and emotion features induced from the response and the addressee’s utterance.
Predicting Addressee’s Emotion
We extract all the n-grams (n g 3) in the response to induce (binary) n-gram features.
Predicting Addressee’s Emotion
The extracted n-grams activate another set of binary n-gram features.
n-gram is mentioned in 4 sentences in this paper.
Topics mentioned in this paper:
Ravi, Sujith
Decipherment Model for Machine Translation
For P(e), we use a word n-gram language model (LM) trained on monolingual target text.
Decipherment Model for Machine Translation
Generate a target (e.g., English) string 6 = 61.43;, with probability P (6) according to an n-gram language model.
Feature-based representation for Source and Target
For instance, context features for word w may include other words (or phrases) that appear in the immediate context ( n-gram window) surrounding w in the monolingual corpus.
Feature-based representation for Source and Target
The feature construction process is described in more detail below: Target Language: We represent each word (or phrase) ei with the following contextual features along with their counts: (a) ficontem: every (word n-gram , position) pair immediately preceding e,-in the monolingual corpus (n=l , position=— l), (b) similar features f+conte$t to model the context following ei, and (c) we also throw in generic context features fscontewt without position information—every word that co-occurs with e, in the same sen-
n-gram is mentioned in 4 sentences in this paper.
Topics mentioned in this paper:
Guo, Weiwei and Diab, Mona
Experiments and Results
The performance of WTMF on CDR is compared with (a) an Information Retrieval model (IR) that is based on surface word matching, (b) an n-gram model ( N-gram ) that captures phrase overlaps by returning the number of overlapping ngrams as the similarity score of two sentences, (c) LSA that uses svds() function in Matlab, and (d) LDA that uses Gibbs Sampling for inference (Griffiths and Steyvers, 2004).
Experiments and Results
The similarity of two sentences is computed by cosine similarity (except N-gram ).
Experiments and Results
We mainly compare the performance of IR, N-gram , LSA, LDA, and WTMF models.
n-gram is mentioned in 4 sentences in this paper.
Topics mentioned in this paper:
Özbal, Gözde and Pighin, Daniele and Strapparava, Carlo
Architecture of BRAINSUP
N-gram likelihood.
Architecture of BRAINSUP
This is simply the likelihood of a sentence estimated by an n-gram language model, to enforce the generation of well-formed word sequences.
Architecture of BRAINSUP
When a solution is not complete, in the computation we include only the sequences of contiguous words (i.e., not interrupted by empty slots) having length greater than or equal to the order of the n-gram model.
Evaluation
The four combinations of features are: base: Target-word scorer + N-gram likelihood + Dependency likelihood + Variety scorer + Unusual-words scorer + Semantic cohesion; base+D: all the scorers in base + Domain relatedness; base+D+C: all the scorers in base+D + Chromatic connotation; base+D+E: all the scorers in base+D + Emotional connotation; base+D+P: all the scorers in base+D + Phonetic features.
n-gram is mentioned in 4 sentences in this paper.
Topics mentioned in this paper:
Pado, Sebastian and Galley, Michel and Jurafsky, Dan and Manning, Christopher D.
Experimental Evaluation
BLEUR includes the following 18 sentence-level scores: BLEU-n and n-gram precision scores (1 g n g 4); BLEU brevity penalty (BP); BLEU score divided by BP.
Experimental Evaluation
To counteract BLEU’s brittleness at the sentence level, we also smooth BLEU-n and n-gram precision as in Lin and Och (2004).
Experimental Evaluation
NIST-n scores (1 g n g 10) and information-weighted n-gram precision scores (1 g n g 4); NIST brevity penalty (BP); and NIST score divided by BP.
Introduction
BLEU and NIST measure MT quality by using the strong correlation between human judgments and the degree of n-gram overlap between a system hypothesis translation and one or more reference translations.
n-gram is mentioned in 4 sentences in this paper.
Topics mentioned in this paper:
Heilman, Michael and Cahill, Aoife and Madnani, Nitin and Lopez, Melissa and Mulholland, Matthew and Tetreault, Joel
Abstract
In this work, we construct a statistical model of grammaticality using various linguistic features (e.g., misspelling counts, parser outputs, n-gram language model scores).
Discussion and Conclusions
While Post found that such a system can effectively distinguish grammatical news text sentences from sentences generated by a language model, measuring the grammaticality of real sentences from language leam-ers seems to require a wider variety of features, including n-gram counts, language model scores, etc.
Experiments
n-gram frequencies from Gigaword and whether the link parser can fully parse the sentence.
System Description
3.2.2 n-gram Count and Language Model Features
n-gram is mentioned in 4 sentences in this paper.
Topics mentioned in this paper:
Lo, Chi-kiu and Wu, Dekai
Abstract
As machine translation systems improve in lexical choice and fluency, the shortcomings of widespread n-gram based, fluency-oriented MT evaluation metrics such as BLEU, which fail to properly evaluate adequacy, become more apparent.
Abstract
N-gram based metrics assume that “good” translations tend to share the same leXical choices as the reference translations.
Abstract
As MT systems improve, the shortcomings of the n-gram based evaluation metrics are becoming more apparent.
n-gram is mentioned in 4 sentences in this paper.
Topics mentioned in this paper:
Bendersky, Michael and Croft, W. Bruce and Smith, David A.
Experiments
SEG-I method requires an access to a large web n-gram corpus (Brants and Franz, 2006).
Experiments
where SQ is the set of all possible query segmenta-tions, 8 is a possible segmentation, s is a segment in S, and count(s) is the frequency of s in the web n-gram corpus.
Experiments
(2009), and include, among others, n-gram frequencies in a sample of a query log, web corpus and Wikipedia titles.
Independent Query Annotations
(2010), an estimate of p(Cz-|7“) is a smoothed estimator that combines the information from the retrieved sentence 7“ with the information about unigrams (for capitalization and POS tagging) and bigrams (for segmentation) from a large n-gram corpus (Brants and Franz, 2006).
n-gram is mentioned in 4 sentences in this paper.
Topics mentioned in this paper:
Labutov, Igor and Lipson, Hod
Model
trailing word in the n-gram ).
Model
Intuitively, an update to the parameter of from 2,24 occurs after the learner observes word w,- in a context (this may be an n-gram , an entire sentence or paragraph containing 10,-, but we will restrict our attention to fixed-length n-grams).
Model
For this task, we focus only on the following context features for predicting the “predictability” of words: n-gram probability, vector-space similarity score, coreferring mentions.
n-gram is mentioned in 4 sentences in this paper.
Topics mentioned in this paper:
Nuhn, Malte and Mauser, Arne and Ney, Hermann
Abstract
On the task shown in (Ravi and Knight, 2011) we obtain better results with only 5% of the computational effort when running our method with an n-gram language model.
Experimental Evaluation
being approximately 15 to 20 times faster than their n-gram based approach.
Experimental Evaluation
To summarize: Our method is significantly faster than n-gram LM based approaches and obtains better results than any previously published method.
Translation Model
Stochastically generate the target sentence according to an n-gram language model.
n-gram is mentioned in 4 sentences in this paper.
Topics mentioned in this paper:
Auli, Michael and Gao, Jianfeng
Decoder Integration
One exception is the n-gram language model which requires the preceding n — 1 words as well.
Decoder Integration
To solve this problem, we follow previous work on lattice rescoring with recurrent networks that maintained the usual n-gram context but kept a beam of hidden layer configurations at each state (Auli et al., 2013).
Decoder Integration
This approximation has been effective for lattice rescoring, since the translations represented by each state are in fact very similar: They share both the same source words as well as the same n-gram context which is likely to result in similar recurrent histories that can be safely pruned.
Introduction
Decoding with feed-forward architectures is straightforward, since predictions are based on a fixed size input, similar to n-gram language models.
n-gram is mentioned in 4 sentences in this paper.
Topics mentioned in this paper:
Chen, Boxing and Kuhn, Roland and Larkin, Samuel
BLEU and PORT
First, define n-gram precision p(n) and recall r(n):
BLEU and PORT
where Pg N) is the geometr1c average of n-gram prec1s1ons
BLEU and PORT
The average precision and average recall used in PORT (unlike those used in BLEU) are the arithmetic average of n-gram precisions Pa(N) and recalls Ra(N):
Experiments
As usual, French-English is the outlier: the two outputs here are typically so similar that BLEU and Qmean tuning yield very similar n-gram statistics.
n-gram is mentioned in 4 sentences in this paper.
Topics mentioned in this paper:
Huang, Fei and Yates, Alexander
Related Work
Smoothing in NLP usually refers to the problem of smoothing n-gram models.
Related Work
Sophisticated smoothing techniques like modified Kneser-Ney and Katz smoothing (Chen and Goodman, 1996) smooth together the predictions of unigram, bi-gram, trigram, and potentially higher n-gram sequences to obtain accurate probability estimates in the face of data sparsity.
Related Work
While n-gram models have traditionally dominated in language modeling, two recent efforts de-
n-gram is mentioned in 4 sentences in this paper.
Topics mentioned in this paper:
Liu, Shujie and Li, Chi-Ho and Li, Mu and Zhou, Ming
Features and Training
Here simple n-gram similarity is used for the sake of efficiency.
Features and Training
Like GC , there are four features with respect to the value of n in n-gram similarity measure.
Graph Construction
Solid lines are edges connecting nodes with sufficient source side n-gram similarity, such as the one between "E A M N" and "E A B C".
Introduction
Collaborative decoding (Li et al., 2009) scores the translation of a source span by its n-gram similarity to the translations by other systems.
n-gram is mentioned in 4 sentences in this paper.
Topics mentioned in this paper:
Salameh, Mohammad and Cherry, Colin and Kondrak, Grzegorz
Methods
In order to annotate lattice edges with an n-gram LM, every path coming into a node must end with the same sequence of (n — l) tokens.
Methods
Programmatic composition of a lattice with an n-gram LM acceptor is a well understood problem.
Methods
With each node corresponding to a single LM context, annotation of outgoing edges with n-gram LM scores is straightforward.
n-gram is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Bansal, Mohit and Burkett, David and de Melo, Gerard and Klein, Dan
Experiments
Feature sources: The n-gram semantic features are extracted from the Google n-grams corpus (Brants and Franz, 2006), a large collection of English n-grams (for n = 1 to 5) and their frequencies computed from almost 1 trillion tokens (95 billion sentences) of Web text.
Features
3.2 Semantic Features 3.2.1 Web n-gram Features Patterns and counts: Hypernymy for a term pair
Features
Hence, we fire Web n-gram pattern features and Wikipedia presence, distance, and pattern features, similar to those described above, on each potential sibling term pair.7 The main difference here from the edge factors is that the sibling factors are symmetric (in the sense that Sig-k, is redundant to Sim) and hence the patterns are undirected.
n-gram is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Plank, Barbara and Hovy, Dirk and Sogaard, Anders
Hard cases and annotation errors
the longest sequence of words (n-gram) in a corpus that has been observed with a token being tagged differently in another occurence of the same n-gram in the same corpus.
Hard cases and annotation errors
For each variation n-gram that we found in WSJ-OO, i.e, a word in various contexts and the possible tags associated with it, we present annotators with the cross product of contexts and tags.
Hard cases and annotation errors
The figure shows, for instance, that the variation n-gram regarding ADP-ADV is the second most frequent one (dark gray), and approximately 70% of ADP-ADV disagreements are linguistically hard cases (light gray).
n-gram is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Xiao, Tong and Zhu, Jingbo and Zhang, Chunliang
A Skeleton-based Approach to MT 2.1 Skeleton Identification
For language modeling, lm is the standard n-gram language model adopted in the baseline system.
A Skeleton-based Approach to MT 2.1 Skeleton Identification
In such a way of string representation, the skeletal language model can be implemented as a standard n-gram language model, that is, a string probability is calculated by a product of a sequence of n-gram probabilities (involving normal words and X).
A Skeleton-based Approach to MT 2.1 Skeleton Identification
The skeletal language model is then trained on these generalized strings in a standard way of n-gram language modeling.
n-gram is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Sajjad, Hassan and Fraser, Alexander and Schmid, Helmut
Models
The N-gram approximation of the joint probability can be defined in terms of multigrams qi as:
Models
N-gram models of order > 1 did not work well because these models tended to learn noise (information from non-transliteration pairs) in the training data.
Previous Research
(2010) submitted another system based on a standard n-gram kernel which ranked first for the English/Hindi and English/Tamil tasks.6 For the English/Arabic task, the transliteration mining system of Noeman and Madkour (2010) was best.
n-gram is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Tomeh, Nadi and Habash, Nizar and Roth, Ryan and Farra, Noura and Dasigi, Pradeep and Diab, Mona
Discriminative Reranking for OCR
Word LM features (“LM-word”) include the log probabilities of the hypothesis obtained using n-gram LMs with n E {1, .
Discriminative Reranking for OCR
Semantic coherence feature (“SemCoh”) is motivated by the fact that semantic information can be very useful in modeling the fluency of phrases, and can augment the information provided by n-gram LMs.
Introduction
The BBN Byblos OCR system (Natajan et al., 2002; Prasad et al., 2008; Saleem et al., 2009), which we use in this paper, relies on a hidden Markov model (HMM) to recover the sequence of characters from the image, and uses an n-gram language model (LM) to emphasize the fluency of the output.
n-gram is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Saluja, Avneesh and Hassan, Hany and Toutanova, Kristina and Quirk, Chris
Evaluation
Further examination of the differences between the two systems yielded that most of the improvements are due to better bigrams and trigrams, as indicated by the breakdown of the BLEU score precision per n-gram , and primarily leverages higher quality generated candidates from the baseline system.
Evaluation
Furthermore, despite completely unaligned, non-comparable monolingual text on the Urdu and English sides, and a very large language model, we can still achieve gains in excess of 1.2 BLEU points (“SLP”) in a difficult evaluation scenario, which shows that the technique adds a genuine translation improvement over and above na‘1've memorization of n-gram sequences.
Generation & Propagation
A nai've way to achieve this goal would be to extract all n-grams, from n = l to a maximum n-gram order, from the monolingual data, but this strategy would lead to a combinatorial explosion in the number of target phrases.
n-gram is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Lin, Hui and Bilmes, Jeff
Submodularity in Summarization
By definition (Lin, 2004), ROUGE-N is the n-gram recall between a candidate summary and a set of reference summaries.
Submodularity in Summarization
Precisely, let S be the candidate summary (a set of sentences extracted from the ground set V), Ce : 2V —> Z+ be the number of times n-gram 6 occurs in summary 8, and R,- be the set of n-grams contained in the reference summary i (suppose we have K reference summaries, i.e., i = 1, - - - ,K).
Submodularity in Summarization
where T6,, is the number of times n-gram 6 occurs in reference summary 2'.
n-gram is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Avramidis, Eleftherios and Koehn, Philipp
Experiments
The NIST metric clearly shows a significant improvement, because it mostly measures difficult n-gram matches (e. g. due to the long-distance rules we have been dealing with).
Experiments
In n-gram based metrics, the scores for all words are equally weighted, so mistakes on crucial sentence constituents may be penalized the same as errors on redundant or meaningless words (Callison-Burch et al., 2006).
Introduction
Thus, with respect to these methods, there is a problem when agreement needs to be applied on part of a sentence whose length exceeds the order of the of the target n-gram language model and the size of the chunks that are translated (see Figure 1 for an exam-
n-gram is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Bartlett, Susan and Kondrak, Grzegorz and Cherry, Colin
Related Work
Chen (2003) uses an n-gram model and Viterbi decoder as a syllabifier, and then applies it as a preprocessing step in his maximum-entropy-based English L2P system.
Syllabification with Structured SVMs
In addition to these primary n-gram features, we experimented with linguistically-derived features.
Syllabification with Structured SVMs
We believe that this is caused by the ability of the SVM to learn such generalizations from the n-gram features alone.
n-gram is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Deng, Yonggang and Xu, Jia and Gao, Yuqing
Features
Trying to find phrase translations for any possible n-gram is not a good idea for two reasons.
Features
We will define a confidence metric to estimate how reliably the model can align an n-gram in one side to a phrase on the other side given a parallel sentence.
Features
Now we turn to monolingual resources to evaluate the quality of an n-gram being a good phrase.
n-gram is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Titov, Ivan and McDonald, Ryan
The Model
In this model the distribution of the overall sentiment rating you is based on all the n-gram features of a review text.
The Model
Then the distribution of ya, for every rated aspect a, can be computed from the distribution of you and from any n-gram feature where at least one word in the n-gram is assigned to the associated aspect topic (7“ 2 Zoe, 2 = a).
The Model
b; is the bias term which regulates the prior distribution P(ya = y), f iterates through all the n-grams, J1me and Jif are common weights and aspect-specific weights for n-gram feature f. pinz is equal to a fraction of words in n-gram feature f assigned to the aspect topic (7“ = [00, z = a).
n-gram is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Das, Dipanjan and Smith, Noah A.
Introduction
Although paraphrase identification is defined in semantic terms, it is usually solved using statistical classifiers based on shallow lexical, n-gram , and syntactic “overlap” features.
Introduction
We use a product of experts (Hinton, 2002) to bring together a logistic regression classifier built from n-gram overlap features and our syntactic model.
Product of Experts
The features are of the form precisionn (number of n-gram matches divided by the number of n-grams in 51), recalln (number of n-gram matches divided by the number of n-grams in 52) and E, (harmonic mean of the previous two features), where l g n g 3.
n-gram is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Jung, Sangkeun and Lee, Cheongjae and Kim, Kyungduk and Lee, Gary Geunbae
Overall architecture
The unseen rate of n-gram varies according to the simulated user.
Overall architecture
Notice that simulated user C, E and H generates higher unseen n-gram patterns over all word error settings.
Related work
N-gram based approaches (Eckert et al., 1997, Levin et al., 2000) and other approaches (Scheffler and Young, 2001, Pietquin and Dutoit, 2006, Schatzmann et al., 2007) are introduced.
n-gram is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Wu, Hua and Wang, Haifeng
Translation Selection
We use smoothed sentence-level BLEU score to replace the human assessments, where we use additive smoothing to avoid zero BLEU scores when we calculate the n-gram precisions.
Translation Selection
1-4 n-gram precisions against pseudo references (l g n g 4)
Translation Selection
15-19 n-gram precision against a target corpus (l g n g 5)
n-gram is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Bojar, Ondřej and Kos, Kamil and Mareċek, David
Introduction
Aside from including dependency and n-gram relations in the scoring, we also apply and evaluate SemPOS for English.
Problems of BLEU
Table 1 estimates the overall magnitude of this issue: For 1-grams to 4-grams in 1640 instances (different MT outputs and different annotators) of 200 sentences with manually flagged errors4, we count how often the n-gram is confirmed by the reference and how often it contains an error flag.
Problems of BLEU
Fortunately, there are relatively few false positives in n-gram based metrics: 6.3% of unigrams and far fewer higher n-grams.
n-gram is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Tratz, Stephen and Hovy, Eduard
Automated Classification
Web 1T N-gram Features
Automated Classification
Table 3 describes the extracted n-gram features.
Automated Classification
The influence of the Web lT n-gram features was somewhat mixed.
n-gram is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Duan, Nan
MBR-based Answering Re-ranking
0 answer-level n-gram correlation feature:
MBR-based Answering Re-ranking
where w denotes an n-gram in A, #w(“4k3) denotes the number of times that w occurs in
MBR-based Answering Re-ranking
o passage-level n-gram correlation feature:
n-gram is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Lin, Ziheng and Ng, Hwee Tou and Kan, Min-Yen
Conclusion
In our approach, n-gram sub-sequences of transitions per term in the discourse role matrix then constitute the more fine-grained evidence used in our model to distinguish coherence from incoherence.
Using Discourse Relations
tion of the n-gram discourse relation transition sequences in gold standard coherent text, and a similar one for incoherent text.
Using Discourse Relations
In our pilot work where we implemented such a basic model with n-gram features for relation transitions, the performance was very poor.
n-gram is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Rush, Alexander M. and Collins, Michael
Background: Hypergraphs
The second step is to integrate an n-gram language model with this hypergraph.
Introduction
The decoding problem for a broad range of these systems (e.g., (Chiang, 2005; Marcu et al., 2006; Shen et al., 2008)) corresponds to the intersection of a (weighted) hypergraph with an n-gram language model.1 The hypergraph represents a large set of possible translations, and is created by applying a synchronous grammar to the source language string.
Introduction
The language model is then uwdwmmmemMMmhmmMnmmwmmmmr Decoding with these models is challenging, largely because of the cost of integrating an n-gram language model into the search process.
n-gram is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Andreevskaia, Alina and Bergler, Sabine
Experiments
Due to lower frequency of higher-order n-grams (as opposed to unigrams), higher-order n-gram language models are more sparse, which increases the probability of missing a particular sentiment marker in a sentence (Table 33).
Experiments
training sets are required to overcome this higher n-gram sparseness in sentence-level annotation.
Experiments
results depends on the genre and size of the n-gram : on product reviews, all results are statistically significant at oz 2 0.025 level; on movie reviews, the difference between NaVe Bayes and SVM is statistically significant at oz 2 0.01 but the significance diminishes as the size of the n- gram increases; on news, only bigrams produce a statistically significant (a = 0.01) difference between the two machine learning methods, while on blogs the difference between SVMs and NaVe Bayes is most pronounced when unigrams are used (a = 0.025).
n-gram is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Zarriess, Sina and Cahill, Aoife and Kuhn, Jonas
Experimental Setup
Standard n-gram features are also used as features.4 The feature model is built as follows: for every lemma in the f-structure, we extract a set of morphological properties (definiteness, person, pronominal status etc.
Experimental Setup
use several standard measures: a) exact match: how often does the model select the original corpus sentence, b) BLEU: n-gram overlap between top-ranked and original sentence, c) NIST: modification of BLEU giving more weight to less frequent n-grams.
Related Work
The first widely known data-driven approach to surface realisation, or tactical generation, (Langk—ilde and Knight, 1998) used language-model n-gram statistics on a word lattice of candidate realisations to guide a ranker.
n-gram is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Bansal, Mohit and Klein, Dan
Abstract
To address semantic ambiguities in coreference resolution, we use Web n-gram features that capture a range of world knowledge in a diffuse but robust way.
Introduction
In order to harness the information on the Web without presupposing a deep understanding of all Web text, we instead turn to a diverse collection of Web n-gram counts (Brants and Franz, 2006) which, in aggregate, contain diffuse and indirect, but often robust, cues to reference.
Semantics via Web Features
These clusters come from distributional K -Means clustering (with K = 1000) on phrases, using the n-gram context as features.
n-gram is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Razmara, Majid and Foster, George and Sankaran, Baskaran and Sarkar, Anoop
Conclusion & Future Work
In addition, we can extend our approach by applying some of the techniques used in other system combination approaches such as consensus decoding, using n-gram features, tuning using forest-based MERT, among other possible extensions.
Related Work 5.1 Domain Adaptation
In other words, it requires all component models to fully decode each sentence, compute n-gram expectations from each component model and calculate posterior probabilities over translation derivations.
Related Work 5.1 Domain Adaptation
Finally, main techniques used in this work are orthogonal to our approach such as Minimum Bayes Risk decoding, using n-gram features and tuning using MERT.
n-gram is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Zweig, Geoffrey and Platt, John C. and Meek, Christopher and Burges, Christopher J.C. and Yessenalina, Ainur and Liu, Qiang
Sentence Completion via Language Modeling
3.1 Backoff N-gram Language Model
Sentence Completion via Language Modeling
3.2 Maximum Entropy Class-Based N-gram Language Model
Sentence Completion via Latent Semantic Analysis
4.3 A LSA N-gram Language Model
n-gram is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Bergsma, Shane and Van Durme, Benjamin
Learning Class Attributes
We extract prevalent common nouns for males and females by selecting only those nouns that (a) occur more than 200 times in the dataset, (b) mostly occur with male or female pronouns, and (c) occur as lowercase more often than uppercase in a web-scale N-gram corpus (Lin et al., 2010).
Learning Class Attributes
We obtain the best of both worlds by matching our precise pattern against a version of the Google N-gram Corpus that includes the part-of-speech tag distributions for every N-gram (Lin et al., 2010).
Twitter Gender Prediction
We include n-gram features with the original capitalization pattern and separate features with the n- grams lower-cased.
n-gram is mentioned in 3 sentences in this paper.
Topics mentioned in this paper: