Index of papers in Proc. ACL 2014 that mention
  • n-gram
Devlin, Jacob and Zbib, Rabih and Huang, Zhongqiang and Lamar, Thomas and Schwartz, Richard and Makhoul, John
Decoding with the NNJ M
Because our NNJM is fundamentally an n-gram NNLM with additional source context, it can easily be integrated into any SMT decoder.
Decoding with the NNJ M
When performing hierarchical decoding with an n-gram LM, the leftmost and rightmost n — 1 words from each constituent must be stored in the state space.
Decoding with the NNJ M
We also train a separate lower-order n-gram model, which is necessary to compute estimate scores during hierarchical decoding.
Introduction
Initially, these models were primarily used to create n-gram neural network language models (NNLMs) for speech recognition and machine translation (Bengio et al., 2003; Schwenk, 2010).
Introduction
Specifically, we introduce a novel formulation for a neural network joint model (NNJ M), which augments an n-gram target language model with an m-word source window.
Model Variations
Specifically, this means that we don’t use dependency-based rule extraction, and our decoder only contains the following MT features: (1) rule probabilities, (2) n-gram Kneser-Ney LM, (3) lexical smoothing, (4) target word count, (5) con-cat rule penalty.
Model Variations
This does not include the cost of n-gram creation or cached lookups, which amount to ~0.03 seconds per source word in our current implementation.14 However, the n-grams created for the NNJ M can be shared with the Kneser-Ney LM, which reduces the cost of that feature.
Model Variations
14In our decoder, roughly 95% of NNJM n-gram lookups within the same sentence are duplicates.
Neural Network Joint Model (NNJ M)
Formally, our model approximates the probability of target hypothesis T conditioned on source sentence S. We follow the standard n-gram LM decomposition of the target, where each target word ti is conditioned on the previous n — 1 target words.
Neural Network Joint Model (NNJ M)
If our neural network has only one hidden layer and is self-normalized, the only remaining computation is 512 calls to tanho and a single 513-dimensional dot product for the final output score.6 Thus, only ~3500 arithmetic operations are required per n-gram lookup, compared to ~2.8M for self-normalized NNJ M without pre-computation, and ~35M for the standard NNJM.7
Neural Network Joint Model (NNJ M)
“lookups/sec” is the number of unique n-gram probabilities that can be computed per second.
n-gram is mentioned in 11 sentences in this paper.
Topics mentioned in this paper:
Wintrode, Jonathan and Khudanpur, Sanjeev
Abstract
We aim to improve spoken term detection performance by incorporating contextual information beyond traditional N-gram language models.
Conclusions
Using word repetitions, we effectively use a broad document context outside of the typical 2-5 N-gram window.
Introduction
ASR systems traditionally use N-gram language models to incorporate prior knowledge of word occurrence patterns into prediction of the next word in the token stream.
Introduction
N-gram models cannot, however, capture complex linguistic or topical phenomena that occur outside the typical 3-5 word scope of the model.
Introduction
Confidence scores from an ASR system (which incorporate N-gram probabilities) are optimized in order to produce the most likely sequence of words rather than the accuracy of individual word detections.
Motivation
We seek a workable definition of broad document context beyond N-gram models that will improve term detection performance on an arbitrary set of queries.
Motivation
A number of efforts have been made to augment traditional N-gram models with latent topic information (Khudanpur and Wu, 1999; Florian and Yarowsky, 1999; Liu and Liu, 2008; Hsu and Glass, 2006; Naptali et al., 2012) including some of the early work on Probabilistic Latent Semantic Analysis by Hofmann (2001).
Term and Document Frequency Statistics
In applying the burstiness quantity to term detection, we recall that the task requires us to locate a particular instance of a term, not estimate a count, hence the utility of N-gram language models predicting words in sequence.
n-gram is mentioned in 9 sentences in this paper.
Topics mentioned in this paper:
Kalchbrenner, Nal and Grefenstette, Edward and Blunsom, Phil
Experiments
The NBoW performs similarly to the non-neural n-gram based classifiers.
Experiments
Besides the RecNN that uses an external parser to produce structural features for the model, the other models use n-gram based or neural features that do not require external resources or additional annotations.
Experiments
We see a significant increase in the performance of the DCNN with respect to the non-neural n-gram based classifiers; in the presence of large amounts of training data these classifiers constitute particularly strong baselines.
Introduction
Convolving the same filter with the n-gram at every position in the sentence allows the features to be extracted independently of their position in the sentence.
Properties of the Sentence Model
4.1 Word and n-Gram Order
Properties of the Sentence Model
For most applications and in order to learn fine-grained feature detectors, it is beneficial for a model to be able to discriminate whether a specific n-gram occurs in the input.
Properties of the Sentence Model
2.3, the Max-TDNN is sensitive to word order, but max pooling only picks out a single n-gram feature in each row of the sentence matrix.
n-gram is mentioned in 8 sentences in this paper.
Topics mentioned in this paper:
Ma, Ji and Zhang, Yue and Zhu, Jingbo
Experiments
For each data set, we investigate an extensive set of combinations of hyper-parameters: the n-gram window (l,r) in {(1, 1), (2,1), (1,2), (2,2)}; the hidden layer size in {200, 300, 400}; the learning rate in {0.1, 0.01, 0.001}.
Experiments
5.3.2 Word and N-gram Representation
Experiments
By contrast, using n-gram representations improves the performance on both oov and non-oov.
Learning from Web Text
The basic idea is to share word representations across different positions in the input n-gram while using position-dependent weights to distinguish between different word orders.
Learning from Web Text
Let V0) represents the j-th visible variable of the WRRBM, which is a vector of length Then V0) 2 wk; means that the j-th word in the n-gram is wk.
Neural Network for POS Disambiguation
The input for the this module is the word n-gram (7,0,4, .
Neural Network for POS Disambiguation
representations of the input n-gram .
Related Work
While those approaches mainly explore token-level representations (word or character embeddings), using WRRBM is able to utilize both word and n-gram representations.
n-gram is mentioned in 8 sentences in this paper.
Topics mentioned in this paper:
Thadani, Kapil
Abstract
Sentence compression has been shown to benefit from joint inference involving both n-gram and dependency-factored objectives but this typically requires expensive integer programming.
Experiments
0 ILP-Dep: A version of the joint ILP of Thadani and McKeown (2013) without n-gram variables and corresponding features.
Experiments
Starting with the n-gram approaches, the performance of 3-LM leads us to observe that the gains of supervised learning far outweigh the utility of higher-order n- gram factorization, which is also responsible for a significant increase in wall-clock time.
Experiments
We were surprised by the strong performance of the dependency-based inference techniques, which yielded results that approached the joint model in both n-gram and parse-based measures.
Introduction
Following an assumption often used in compression systems, the compressed output in this corpus is constructed by dropping tokens from the input sentence without any paraphrasing or reordering.1 A number of diverse approaches have been proposed for deletion-based sentence compression, including techniques that assemble the output text under an n-gram factorization over the input text (McDonald, 2006; Clarke and Lapata, 2008) or an arc factorization over input dependency parses (Filippova and Strube, 2008; Galanis and Androutsopoulos, 2010; Filippova and Altun, 2013).
Introduction
In this work, we develop approximate inference strategies to the joint approach of Thadani and McKeown (2013) which trade the optimality guarantees of exact ILP for faster inference by separately solving the n-gram and dependency subproblems and using Lagrange multipliers to enforce consistency between their solutions.
Multi-Structure Sentence Compression
Let a(y) E {0, 1}” denote the incidence vector of tokens contained in the n-gram sequence y and ,6(z) E {0, 1}” denote the incidence vector of words contained in the dependency tree 2.
n-gram is mentioned in 7 sentences in this paper.
Topics mentioned in this paper:
Zhang, Hui and Chiang, David
Smoothing on count distributions
On integral counts, this is simple: we generate, for each n-gram type vu’w, an (n — l)—gram token u’w, for a total of n1+(ou’w) tokens.
Smoothing on count distributions
Analogously, on count distributions, for each n-gram type vu’w, we generate an (n — l)-gram token u’w with probability p(c(vu’w) > 0).
Smoothing on count distributions
Using the dynamic program in Section 3.2, computing the distributions for each r is linear in the number of n-gram types, and we only need to compute the distributions up to r = 2 (or r = 4 for modified KN), and store them for r = 0 (or up to r = 2 for modified KN).
Smoothing on integral counts
Let uw stand for an n-gram , where u stands for the (n — l) context words and w, the predicted word.
Smoothing on integral counts
where n1+(uo) = |{w | C(uw) > OH is the number of word types observed after context u, and qu(w) specifies how to distribute the subtracted discounts among unseen n-gram types.
Smoothing on integral counts
The probability of an n-gram token uw using the other tokens as training data is
n-gram is mentioned in 7 sentences in this paper.
Topics mentioned in this paper:
Chen, Yanping and Zheng, Qinghua and Zhang, Wei
Feature Construction
Despite the Omni-word can be seen as a subset of n-Gram feature.
Feature Construction
It is not the same as the n-Gram feature.
Feature Construction
N-Gram features are more fragmented.
n-gram is mentioned in 5 sentences in this paper.
Topics mentioned in this paper:
Auli, Michael and Gao, Jianfeng
Decoder Integration
One exception is the n-gram language model which requires the preceding n — 1 words as well.
Decoder Integration
To solve this problem, we follow previous work on lattice rescoring with recurrent networks that maintained the usual n-gram context but kept a beam of hidden layer configurations at each state (Auli et al., 2013).
Decoder Integration
This approximation has been effective for lattice rescoring, since the translations represented by each state are in fact very similar: They share both the same source words as well as the same n-gram context which is likely to result in similar recurrent histories that can be safely pruned.
Introduction
Decoding with feed-forward architectures is straightforward, since predictions are based on a fixed size input, similar to n-gram language models.
n-gram is mentioned in 4 sentences in this paper.
Topics mentioned in this paper:
Heilman, Michael and Cahill, Aoife and Madnani, Nitin and Lopez, Melissa and Mulholland, Matthew and Tetreault, Joel
Abstract
In this work, we construct a statistical model of grammaticality using various linguistic features (e.g., misspelling counts, parser outputs, n-gram language model scores).
Discussion and Conclusions
While Post found that such a system can effectively distinguish grammatical news text sentences from sentences generated by a language model, measuring the grammaticality of real sentences from language leam-ers seems to require a wider variety of features, including n-gram counts, language model scores, etc.
Experiments
n-gram frequencies from Gigaword and whether the link parser can fully parse the sentence.
System Description
3.2.2 n-gram Count and Language Model Features
n-gram is mentioned in 4 sentences in this paper.
Topics mentioned in this paper:
Labutov, Igor and Lipson, Hod
Model
trailing word in the n-gram ).
Model
Intuitively, an update to the parameter of from 2,24 occurs after the learner observes word w,- in a context (this may be an n-gram , an entire sentence or paragraph containing 10,-, but we will restrict our attention to fixed-length n-grams).
Model
For this task, we focus only on the following context features for predicting the “predictability” of words: n-gram probability, vector-space similarity score, coreferring mentions.
n-gram is mentioned in 4 sentences in this paper.
Topics mentioned in this paper:
Bansal, Mohit and Burkett, David and de Melo, Gerard and Klein, Dan
Experiments
Feature sources: The n-gram semantic features are extracted from the Google n-grams corpus (Brants and Franz, 2006), a large collection of English n-grams (for n = 1 to 5) and their frequencies computed from almost 1 trillion tokens (95 billion sentences) of Web text.
Features
3.2 Semantic Features 3.2.1 Web n-gram Features Patterns and counts: Hypernymy for a term pair
Features
Hence, we fire Web n-gram pattern features and Wikipedia presence, distance, and pattern features, similar to those described above, on each potential sibling term pair.7 The main difference here from the edge factors is that the sibling factors are symmetric (in the sense that Sig-k, is redundant to Sim) and hence the patterns are undirected.
n-gram is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Plank, Barbara and Hovy, Dirk and Sogaard, Anders
Hard cases and annotation errors
the longest sequence of words (n-gram) in a corpus that has been observed with a token being tagged differently in another occurence of the same n-gram in the same corpus.
Hard cases and annotation errors
For each variation n-gram that we found in WSJ-OO, i.e, a word in various contexts and the possible tags associated with it, we present annotators with the cross product of contexts and tags.
Hard cases and annotation errors
The figure shows, for instance, that the variation n-gram regarding ADP-ADV is the second most frequent one (dark gray), and approximately 70% of ADP-ADV disagreements are linguistically hard cases (light gray).
n-gram is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Salameh, Mohammad and Cherry, Colin and Kondrak, Grzegorz
Methods
In order to annotate lattice edges with an n-gram LM, every path coming into a node must end with the same sequence of (n — l) tokens.
Methods
Programmatic composition of a lattice with an n-gram LM acceptor is a well understood problem.
Methods
With each node corresponding to a single LM context, annotation of outgoing edges with n-gram LM scores is straightforward.
n-gram is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Saluja, Avneesh and Hassan, Hany and Toutanova, Kristina and Quirk, Chris
Evaluation
Further examination of the differences between the two systems yielded that most of the improvements are due to better bigrams and trigrams, as indicated by the breakdown of the BLEU score precision per n-gram , and primarily leverages higher quality generated candidates from the baseline system.
Evaluation
Furthermore, despite completely unaligned, non-comparable monolingual text on the Urdu and English sides, and a very large language model, we can still achieve gains in excess of 1.2 BLEU points (“SLP”) in a difficult evaluation scenario, which shows that the technique adds a genuine translation improvement over and above na‘1've memorization of n-gram sequences.
Generation & Propagation
A nai've way to achieve this goal would be to extract all n-grams, from n = l to a maximum n-gram order, from the monolingual data, but this strategy would lead to a combinatorial explosion in the number of target phrases.
n-gram is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Xiao, Tong and Zhu, Jingbo and Zhang, Chunliang
A Skeleton-based Approach to MT 2.1 Skeleton Identification
For language modeling, lm is the standard n-gram language model adopted in the baseline system.
A Skeleton-based Approach to MT 2.1 Skeleton Identification
In such a way of string representation, the skeletal language model can be implemented as a standard n-gram language model, that is, a string probability is calculated by a product of a sequence of n-gram probabilities (involving normal words and X).
A Skeleton-based Approach to MT 2.1 Skeleton Identification
The skeletal language model is then trained on these generalized strings in a standard way of n-gram language modeling.
n-gram is mentioned in 3 sentences in this paper.
Topics mentioned in this paper: