Index of papers in Proc. ACL that mention
  • bigram
Li, Chen and Qian, Xian and Liu, Yang
Abstract
In this paper, we propose a bigram based supervised method for extractive document summarization in the integer linear programming (ILP) framework.
Abstract
For each bigram , a regression model is used to estimate its frequency in the reference summary.
Abstract
The regression model uses a variety of indicative features and is trained discriminatively to minimize the distance between the estimated and the ground truth bigram frequency in the reference summary.
Introduction
They used bigrams as such language concepts.
Introduction
Gillick and Favre (Gillick and Favre, 2009) used bigrams as concepts, which are selected from a subset of the sentences, and their document frequency as the weight in the objective function.
Introduction
In this paper, we propose to find a candidate summary such that the language concepts (e. g., bigrams ) in this candidate summary and the reference summary can have the same frequency.
Proposed Method 2.1 Bigram Gain Maximization by ILP
We choose bigrams as the language concepts in our proposed method since they have been successfully used in previous work.
Proposed Method 2.1 Bigram Gain Maximization by ILP
In addition, we expect that the bigram oriented ILP is consistent with the ROUGE-2 measure widely used for summarization evaluation.
bigram is mentioned in 114 sentences in this paper.
Topics mentioned in this paper:
Chong, Tze Yuang and E. Banchs, Rafael and Chng, Eng Siong and Li, Haizhou
Abstract
Evaluated on the WSJ corpus, bigram and trigram model perplexity were reduced up to 23.5% and 14.0%, respectively.
Abstract
Compared to the distant bigram , we show that word-pairs can be more effectively modeled in terms of both distance and occurrence.
Language Modeling with TD and TO
The prior, which is usually implemented as a unigram model, can be also replaced with a higher order n-gram model as, for instance, the bigram model:
Perplexity Evaluation
As seen from the table, for lower order n- gram models, the complementary information captured by the TD and TO components reduced the perplexity up to 23.5% and 14.0%, for bigram and trigram models, respectively.
Perplexity Evaluation
ter modeling of word-pairs compared to the distant bigram model.
Perplexity Evaluation
Here we compare the perplexity of both, the distance-k bigram model and distance-k TD model (for values of k ranging from two to ten), when combined with a standard bi gram model.
Related Work
The distant bigram model (Huang et.al 1993, Simon et al.
Related Work
2007) disassembles the n-gram into (n—l) word-pairs, such that each pair is modeled by a distance-k bigram model, where 1 S k s n — 1 .
Related Work
Each distance-k bigram model predicts the target-word based on the occurrence of a history-word located k positions behind.
bigram is mentioned in 17 sentences in this paper.
Topics mentioned in this paper:
Börschinger, Benjamin and Johnson, Mark and Demuth, Katherine
Abstract
We find that Bigram dependencies are important for performing well on real data and for learning appropriate deletion probabilities for different contexts.1
Experiments 4.1 The data
Best performance for both the Unigram and the Bigram model in the GOLD-p condition is achieved under the left-right setting, in line with the standard analyses of /t/-deleti0n as primarily being determined by the preceding and the following context.
Experiments 4.1 The data
For the LEARN-p condition, the Bigram model still performs best in the left-right setting but the Unigram model’s performance drops
Experiments 4.1 The data
Note how the Unigram model always suffers in the LEARN- p condition whereas the Bigram model’s performance is actually best for LEARN- p in the left-right setting.
Introduction
We find that models that capture bigram dependencies between underlying forms provide considerably more accurate estimates of those probabilities than corresponding unigram or “bag of words” models of underlying forms.
The computational model
Our models build on the Unigram and the Bigram model introduced in Goldwater et al.
The computational model
Figure 1 shows the graphical model for our joint Bigram model (the Unigram case is trivially recovered by generating the Ums directly from L rather than from LUi,j_1).
The computational model
Figure l: The graphical model for our joint model of word-final /t/-deletion and Bigram word segmentation.
bigram is mentioned in 31 sentences in this paper.
Topics mentioned in this paper:
Bhat, Suma and Xue, Huichao and Yoon, Su-Youn
Models for Measuring Grammatical Competence
Then, regarding POS bigrams as terms, they construct POS-based vector space models for each score-class (there are four score classes denoting levels of proficiency as will be explained in Section 5.2), thus yielding four score-specific vector-space models (VSMs).
Models for Measuring Grammatical Competence
0 0034: the cosine similarity score between the test response and the vector of POS bigrams for the highest score class (level 4); and,
Models for Measuring Grammatical Competence
First, the VSM-based method is likely to overestimate the contribution of the POS bigrams when highly correlated bigrams occur as terms in the VSM.
Related Work
In order to avoid the problems encountered with deep analysis-based measures, Yoon and Bhat (2012) explored a shallow analysis-based approach, based on the assumption that the level of grammar sophistication at each proficiency level is reflected in the distribution of part-of-speech (POS) tag bigrams .
Shallow-analysis approach to measuring syntactic complexity
The measures of syntactic complexity in this approach are POS bigrams and are not obtained by a deep analysis (syntactic parsing) of the structure of the sentence.
Shallow-analysis approach to measuring syntactic complexity
In a shallow-analysis approach to measuring syntactic complexity, we rely on the distribution of POS bigrams at every profi-
Shallow-analysis approach to measuring syntactic complexity
Consider the two sentence fragments below taken from actual responses (the bigrams of interest and their associated POS tags are boldfaced).
bigram is mentioned in 22 sentences in this paper.
Topics mentioned in this paper:
Bollegala, Danushka and Weir, David and Carroll, John
Distribution Prediction
For this purpose, we represent a word 21) using unigrams and bigrams that co-occur with w in a sentence as follows.
Distribution Prediction
Next, we generate bigrams of word lemmas and remove any bigrams that consists only of stop words.
Distribution Prediction
Bigram features capture negations more accurately than unigrams, and have been found to be useful for sentiment classification tasks.
bigram is mentioned in 15 sentences in this paper.
Topics mentioned in this paper:
Schütze, Hinrich
Experimental Setup
256,873 unique unigrams and 4,494,222 unique bigrams .
Experimental Setup
We cluster unigrams (i = l) and bigrams (i = 2).
Experimental Setup
SRILM does not directly support bigram clustering.
Models
The parameters d’, d”, and d’” are the discounts for unigrams, bigrams and trigrams, respectively, as defined by Chen and Goodman (1996, p. 20, (26)).
Models
bigram ) histories that is covered by the clusters.
Models
We cluster bigram histories and unigram histories separately and write 193 (7.03 |w1w2) for the bigram cluster model and pB(w3|w2) for the unigram cluster model.
bigram is mentioned in 25 sentences in this paper.
Topics mentioned in this paper:
Pei, Wenzhe and Ge, Tao and Chang, Baobao
Experiment
Model PKU MSRA Best05(Chen et al., 2005) 95.0 96.0 Best05(Tseng et al., 2005) 95.0 96.4 (Zhang et al., 2006) 95.1 97.1 (Zhang and Clark, 2007) 94.5 97.2 (Sun et al., 2009) 95.2 97.3 (Sun et al., 2012) 95.4 97.4 (Zhang et al., 2013) 96.1 97.4 MMTNN 94.0 94.9 MMTNN + bigram 95.2 97.2
Experiment
A very common feature in Chinese word segmentation is the character bigram feature.
Experiment
Formally, at the i-th character of a sentence cum] , the bigram features are ckck+1(i — 3 < k < z' + 2).
Introduction
Therefore, we integrate additional simple character bigram features into our model and the result shows that our model can achieve a competitive performance that other systems hardly achieve unless they use more complex task-specific features.
Related Work
Most previous systems address this task by using linear statistical models with carefully designed features such as bigram features, punctuation information (Li and Sun, 2009) and statistical information (Sun and Xu, 2011).
bigram is mentioned in 11 sentences in this paper.
Topics mentioned in this paper:
Bollegala, Danushka and Weir, David and Carroll, John
A Motivating Example
( bigrams ) survey+development, develop-ment+civilization
Feature Expansion
,w N}, where the elements 212,- are either unigrams or bigrams that appear in the review d. We then represent a review d by a real-valued term-frequency vector d 6 RN , where the value of the j-th element dj is set to the total number of occurrences of the unigram or bigram wj in the review d. To find the suitable candidates to expand a vector d for the review d, we define a ranking score score(ui, d) for each base entry in the thesaurus as follows:
Feature Expansion
Moreover, we weight the relatedness scores for each word wj by its normalized term-frequency to emphasize the salient unigrams and bigrams in a review.
Feature Expansion
This is particularly important because we would like to score base entries ui considering all the unigrams and bigrams that appear in a review d, instead of considering each unigram or bigram individually.
Introduction
a unigram or a bigram of word lemma) in a review using a feature vector.
Sentiment Sensitive Thesaurus
We select unigrams and bigrams from each sentence.
Sentiment Sensitive Thesaurus
For the remainder of this paper, we will refer to unigrams and bigrams collectively as lexical elements.
Sentiment Sensitive Thesaurus
Previous work on sentiment classification has shown that both unigrams and bigrams are useful for training a sentiment classifier (Blitzer et al., 2007).
bigram is mentioned in 13 sentences in this paper.
Topics mentioned in this paper:
Blunsom, Phil and Cohn, Trevor
Background
This work differs from previous Bayesian models in that we explicitly model a complex backoff path using a hierachical prior, such that our model jointly infers distributions over tag trigrams, bigrams and unigrams and whole words and their character level representation.
Experiments
mkcls (Och, 1999) 73.7 65.6 MLE 1HMM—LM (Clark, 2003)* 71.2 65.5 BHMM (GG07) 63.2 56.2 PR (Ganchev et al., 2010)* 62.5 54.8 Trigram PYP—HMM 69.8 62.6 Trigram PYP-lHMM 76.0 68.0 Trigram PYP—lHMM-LM 77.5 69.7 Bigram PYP-HMM 66.9 59.2 Bigram PYP— 1HMM 72.9 65.9 Trigram DP—HMM 68.1 60.0 Trigram DP— 1HMM 76.0 68.0 Trigram DP— 1HMM—LM 76.8 69.8
Experiments
If we restrict the model to bigrams we see a considerable drop in performance.
The PYP-HMM
The trigram transition distribution, Tij, is drawn from a hierarchical PYP prior which backs off to a bigram Bj and then a unigram U distribution,
The PYP-HMM
This allows the modelling of trigram tag sequences, while smoothing these estimates with their corresponding bigram and unigram distributions.
The PYP-HMM
We formulate the character—level language model as a bigram model over the character sequence comprising word 7111,
bigram is mentioned in 13 sentences in this paper.
Topics mentioned in this paper:
Berg-Kirkpatrick, Taylor and Gillick, Dan and Klein, Dan
Joint Model
While past extractive methods have assigned value to individual sentences and then explicitly represented the notion of redundancy (Carbonell and Goldstein, 1998), recent methods show greater success by using a simpler notion of coverage: bigrams
Joint Model
Note that there is intentionally a bigram missing from (a).
Joint Model
contribute content, and redundancy is implicitly encoded in the fact that redundant sentences cover fewer bigrams (Nenkova and Vanderwende, 2005; Gillick and Favre, 2009).
Structured Learning
We use bigram recall as our loss function (see Section 3.3).
Structured Learning
Luckily, our choice of loss function, bigram recall, factors over bigrams .
Structured Learning
We simply modify each bigram value 2);, to include bigram b’s contribution to the total loss.
bigram is mentioned in 36 sentences in this paper.
Topics mentioned in this paper:
Nuhn, Malte and Ney, Hermann
Abstract
In this paper we show that even for the case of 1:1 substitution ciphers—which encipher plaintext symbols by exchanging them with a unique substitute—finding the optimal decipherment with respect to a bigram language model is NP-hard.
Definitions
similarly define the bigram count N f f/ of f, f ’ E Vf as
Definitions
(a) N f fl are integer counts > 0 of bigrams found in the ciphertext ffv .
Definitions
(b) Given the first and last token of the cipher f1 and fN, the bigram counts involving the sentence boundary token $ need to fulfill
Introduction
Section 5 shows the connection between the quadratic assignment problem and decipherment using a bigram language model.
bigram is mentioned in 14 sentences in this paper.
Topics mentioned in this paper:
Ravi, Sujith and Baldridge, Jason and Knight, Kevin
Introduction
Most methods have employed some variant of Expectation Maximization (EM) to learn parameters for a bigram
Introduction
Ravi and Knight (2009) achieved the best results thus far (92.3% word token accuracy) via a Minimum Description Length approach using an integer program (IP) that finds a minimal bigram grammar that obeys the tag dictionary constraints and covers the observed data.
Minimized models for supertagging
The 1241 distinct supertags in the tagset result in 1.5 million tag bigram entries in the model and the dictionary contains almost 3.5 million word/tag pairs that are relevant to the test data.
Minimized models for supertagging
The set of 45 P08 tags for the same data yields 2025 tag bigrams and 8910 dictionary entries.
Minimized models for supertagging
Our objective is to find the smallest supertag grammar (of tag bigram types) that explains the entire text while obeying the lexicon’s constraints.
bigram is mentioned in 26 sentences in this paper.
Topics mentioned in this paper:
Dickinson, Markus
Ad hoc rule detection
3.4 Bigram anomalies 3.4.1 Motivation
Ad hoc rule detection
The bigram method examines relationships between adjacent sisters, complementing the whole rule method by focusing on local properties.
Ad hoc rule detection
But only the final elements have anomalous bigrams : HD:ID IR:IR, IR:IR ANzRO, and ANzRO J RzIR all never occur.
Additional information
This rule is entirely correct, yet the XXzXX position has low whole rule and bigram scores.
Approach
First, the bigram method abstracts a rule to its bigrams .
Evaluation
For example, the bigram method with a threshold of 39 leads to finding 283 errors (455 x .622).
Evaluation
The whole rule and bigram methods reveal greater precision in identifying problematic dependencies, isolating elements with lower UAS and LAS scores than with frequency, along with corresponding greater pre-
Introduction and Motivation
We propose to flag erroneous parse rules, using information which reflects different grammatical properties: POS lookup, bigram information, and full rule comparisons.
bigram is mentioned in 19 sentences in this paper.
Topics mentioned in this paper:
Ravi, Sujith and Knight, Kevin
Fitting the Model
We still give EM the full word/tag dictionary, but now we constrain its initial grammar model to the 459 tag bigrams identified by IP.
Fitting the Model
In addition to removing many bad tag bigrams from the grammar, IP minimization also removes some of the good ones, leading to lower recall (EM = 0.87, IP+EM = 0.57).
Fitting the Model
During EM training, the smaller grammar with fewer bad tag bigrams helps to restrict the dictionary model from making too many bad choices that EM made earlier.
Small Models
That is, there exists a tag sequence that contains 459 distinct tag bigrams , and no other tag sequence contains fewer.
Small Models
We also create variables for every possible tag bigram and word/tag dictionary entry.
What goes wrong with EM?
We investigate the Viterbi tag sequence generated by EM training and count how many distinct tag bigrams there are in that sequence.
bigram is mentioned in 11 sentences in this paper.
Topics mentioned in this paper:
Saluja, Avneesh and Hassan, Hany and Toutanova, Kristina and Quirk, Chris
Evaluation
In our first set of experiments, we looked at the impact of choosing bigrams over unigrams as our basic unit of representation, along with performance of LP (Eq.
Evaluation
Table 4 presents the results of these variations; overall, by taking into account generated candidates appropriately and using bigrams (“SLP 2-gram”), we obtained a 1.13 BLEU gain on the test set.
Evaluation
Using unigrams (“SLP l-gram”) actually does worse than the baseline, indicating the importance of focusing on translations for sparser bigrams .
Generation & Propagation
Although our technique applies to phrases of any length, in this work we concentrate on unigram and bigram phrases, which provides substantial computational cost savings.
Generation & Propagation
We only consider target phrases whose source phrase is a bigram , but it is worth noting that the target phrases are of variable length.
Generation & Propagation
To generate new translation candidates using the baseline system, we decode each unlabeled source bigram to generate its m-best translations.
bigram is mentioned in 13 sentences in this paper.
Topics mentioned in this paper:
Thadani, Kapil
Multi-Structure Sentence Compression
Following this, §2.3 discusses a dynamic program to find maximum weight bigram subsequences from the input sentence, while §2.4 covers LP relaxation-based approaches for approximating solutions to the problem of finding a maximum-weight subtree in a graph of potential output dependencies.
Multi-Structure Sentence Compression
C. In addition, we define bigram indicator variables yij E {0, l} to represent whether a particular order-preserving bigram2 (ti, tj> from S is present as a contiguous bigram in C as well as dependency indicator variables zij E {0, 1} corresponding to whether the dependency arc ti —> 253- is present in the dependency parse of C. The score for a given compression 0 can now be defined to factor over its tokens, n-grams and dependencies as follows.
Multi-Structure Sentence Compression
where Qtok, ngr and 6dep are feature-based scoring functions for tokens, bigrams and dependencies respectively.
bigram is mentioned in 13 sentences in this paper.
Topics mentioned in this paper:
Cao, Guihong and Robertson, Stephen and Nie, Jian-Yun
Abstract
The selection is made according to the appropriateness of the alteration to the query context (using a bigram language model), or according to its expected impact on the retrieval effectiveness (using a regression model).
Bigram Expansion Model for Alteration Selection
The query context is modeled by a bigram language model as in (Peng et al.
Bigram Expansion Model for Alteration Selection
In this work, we used bigram language model to calculate the probability of each path.
Introduction
The query context is modeled by a bigram language model.
Introduction
which is the most coherent with the bigram model.
Introduction
We call this model Bigram Expansion.
Related Work
2007), a bigram language model is used to determine the alteration of the head word that best fits the query.
Related Work
In this paper, one of the proposed methods will also use a bigram language model of the query to determine the appropriate alteration candidates.
bigram is mentioned in 25 sentences in this paper.
Topics mentioned in this paper:
Zhang, Hao and Gildea, Daniel
Abstract
We take a multi-pass approach to machine translation decoding when using synchronous context-free grammars as the translation model and n-gram language models: the first pass uses a bigram language model, and the resulting parse forest is used in the second pass to guide search with a trigram language model.
Abstract
The trigram pass closes most of the performance gap between a bigram decoder and a much slower trigram decoder, but takes time that is insignificant in comparison to the bigram pass.
Decoding to Maximize BLEU
The outside-pass Algorithm 1 for bigram decoding can be generalized to the trigram case.
Experiments
Hyperedges BLEU Bigram Pass 167K 21.77 Trigram Pass UNI — —BO + 629.7K=796.7K 23.56 BO+BB +2.7K =169.
Introduction
First, we present a two-pass decoding algorithm, in which the first pass explores states resulting from an integrated bigram language model, and the second pass expands these states into trigram-based
Introduction
With this heuristic, we achieve the same BLEU scores and model cost as a trigram decoder with essentially the same speed as a bigram decoder.
Multi-pass LM-Integrated Decoding
More specifically, a bigram decoding pass is executed forward and backward to figure out the probability of each state.
Multi-pass LM-Integrated Decoding
We take the same view as in speech recognition that a trigram integrated model is a finer-grained model than bigram model and in general we can do an n — l-gram decoding as a predicative pass for the following n-gram pass.
Multi-pass LM-Integrated Decoding
If we can afford a bigram decoding pass, the outside cost from a bigram model is conceivably a
bigram is mentioned in 15 sentences in this paper.
Topics mentioned in this paper:
Dickinson, Markus
Evaluation
The results are shown in figure 1 for the whole daughters scoring method and in figure 2 for the bigram method.
Evaluation
Figure 2: Bigram ungeneralizability (devo.)
Rule dissimilarity and generalizability
To do this, we can examine the weakest parts of each rule and compare those across the corpus, to see which anomalous patterns emerge; we do this in the Bigram scoring section below.
Rule dissimilarity and generalizability
Bigram scoring The other method of detecting ad hoc rules calculates reliability scores by focusing specifically on what the classes do not have in common.
Rule dissimilarity and generalizability
We abstract to bigrams , including added START and END tags, as longer sequences risk missing generalizations; e. g., unary rules would have no comparable rules.
bigram is mentioned in 20 sentences in this paper.
Topics mentioned in this paper:
Uszkoreit, Jakob and Brants, Thorsten
Distributed Clustering
With each word w in one of these sets, all words 1) preceding w in the corpus are stored with the respective bigram count N (v, w).
Distributed Clustering
While the greedy non-distributed exchange algorithm is guaranteed to converge as each exchange increases the log likelihood of the assumed bigram model, this is not necessarily true for the distributed exchange algorithm.
Exchange Clustering
Beginning with an initial clustering, the algorithm greedily maximizes the log likelihood of a two-sided class bigram or trigram model as described in Eq.
Exchange Clustering
With N g” and N fuc denoting the average number of clusters preceding and succeeding another cluster, B denoting the number of distinct bigrams in the training corpus, and I denoting the number of iterations, the worst case complexity of the algorithm is in:
Exchange Clustering
When using large corpora With large numbers of bigrams the number of required updates can increase towards the quadratic upper bound as N?” and N 5% approach NC.
Predictive Exchange Clustering
Modifying the exchange algorithm in order to optimize the log likelihood of a predictive class bigram model, leads to substantial performance improvements, similar to those previously reported for another type of one-sided class model in (Whittaker and Woodland, 2001).
Predictive Exchange Clustering
We use a predictive class bigram model as given in Eq.
Predictive Exchange Clustering
Then the following optimization criterion can be derived, with F(C) being the log likelihood function of the predictive class bigram model given a clustering C:
bigram is mentioned in 12 sentences in this paper.
Topics mentioned in this paper:
Chan, Yee Seng and Ng, Hwee Tou
Metric Design Considerations
Similarly, we also match the bigrams and trigrams of the sentence pair and calculate their corresponding Fmean scores.
Metric Design Considerations
where in our experiments, we set N =3, representing calculation of unigram, bigram , and trigram scores.
Metric Design Considerations
To determine the number matchbi of bigram matches, a system bigram (lsipsi, l8i+1psi+1) matches a reference bigram (lripme +1 pm.
bigram is mentioned in 16 sentences in this paper.
Topics mentioned in this paper:
Mochihashi, Daichi and Yamada, Takeshi and Ueda, Naonori
Experiments
Bigram and trigram performances are similar for Chinese, but trigram performs better for Japanese.
Experiments
In fact, although the difference in perplexity per character is not so large, the perplexity per word is radically reduced: 439.8 ( bigram ) to 190.1 (trigram).
Inference
Furthermore, it has an inherent limitation that it cannot deal with larger than bigrams , because it uses only local statistics between directly contiguous words for word segmentation.
Inference
It has an additional advantage in that we can accommodate higher-order relationships than bigrams , particularly trigrams, for word segmentation.
Inference
For this purpose, we maintain a forward variable a[t] in the bigram case.
Introduction
Crucially, since they rely on sampling a word boundary between two neighboring words, they can leverage only up to bigram word dependencies.
Pitman-Yor process and n-gram models
The bigram distribution G2 = {
bigram is mentioned in 10 sentences in this paper.
Topics mentioned in this paper:
Hall, David and Klein, Dan
Experiments
We follow prior work and use sets of bigrams within words.
Experiments
In our case, during bipartite matching the set X is the set of bigrams in the language being re-permuted, and Y is the union of bigrams in the other languages.
Experiments
Besides the heuristic baseline, we tried our model-based approach using Unigrams, Bigrams and Anchored Unigrams, with and without learning the parametric edit distances.
Message Approximation
Figure 2: Various topologies for approximating topologies: (a) a unigram model, (b) a bigram model, (c) the anchored uni gram model, and (d) the n-best plus backoff model used in Dreyer and Eisner (2009).
Message Approximation
The first is a plain unigram model, the second is a bigram model, and the third is an anchored unigram topology: a position-specific unigram model for each position up to some maximum length.
Message Approximation
The second topology we consider is the bigram topology, illustrated in Figure 2(b).
bigram is mentioned in 9 sentences in this paper.
Topics mentioned in this paper:
Rush, Alexander M. and Collins, Michael
A Simple Lagrangian Relaxation Algorithm
We now give a Lagrangian relaxation algorithm for integration of a hypergraph with a bigram language model, in cases where the hypergraph satisfies the following simplifying assumption:
A Simple Lagrangian Relaxation Algorithm
over the original (non-intersected) hypergraph, with leaf nodes having weights 6v + 0% + (3) If the output derivation from step 2 has the same set of bigrams as those from step 1, then we have an exact solution to the problem.
A Simple Lagrangian Relaxation Algorithm
C1 states that each leaf in a derivation has exactly one incoming bigram, and that each leaf not in the derivation has 0 incoming bigrams; C2 states that each leaf in a derivation has exactly one outgoing bigram , and that each leaf not in the derivation has 0 outgoing bigrams.6
Background: Hypergraphs
Throughout this paper we make the following assumption when using a bigram language model:
Background: Hypergraphs
Assumption 3.1 ( Bigram start/end assumption.)
The Full Algorithm
The set ’P of trigram paths plays an analogous role to the set 23 of bigrams in our previous algorithm.
bigram is mentioned in 9 sentences in this paper.
Topics mentioned in this paper:
Kruengkrai, Canasai and Uchimoto, Kiyotaka and Kazama, Jun'ichi and Wang, Yiou and Torisawa, Kentaro and Isahara, Hitoshi
Training method
We broadly classify features into two categories: unigram and bigram features.
Training method
TBO (TE (w_1)) T31 <TE(w—1)7P0> T32 <TE(w—1),P—1,P0> TB3 (TE(w_1),TB(w0)) <TE(w—1)7T3(w0)7p0> <TE(w—1)7p—17T3(w0)> T36 <TE(w—1)7p—17TB(w0)7p0> CBO <p_1, p0) otherwise Table 3: Bigram features.
Training method
Bigram features: Table 3 shows our bigram features.
bigram is mentioned in 9 sentences in this paper.
Topics mentioned in this paper:
Hirao, Tsutomu and Suzuki, Jun and Isozaki, Hideki
A Syntax Free Sequence-oriented Sentence Compression Method
1 if 1(yj) = (yj—1)+ 1 APLM Bigram (w71(yj)71(yj—1)) (5) otherwise
A Syntax Free Sequence-oriented Sentence Compression Method
Here, 0 g APLM g l, Bigram(-) indicates word bigram probability.
A Syntax Free Sequence-oriented Sentence Compression Method
The first line of equation (5) agrees with Jing’s observation on sentence alignment tasks (Jing and McKeown, 1999); that is, most (or almost all) bigrams in a compressed sentence appear in the original sentence as they are.
Experimental Evaluation
For example, label ‘w/o IPTW + Dep’ employs IDF term weighting as function and word bigram, part-of-speech bigram and dependency probability between words as function in equation (1).
Results and Discussion
Replacing PLM with the bigram language model (w/o PLM) degrades the performance significantly.
Results and Discussion
Most bigrams in a compressed sentence followed those in the source sentence.
Results and Discussion
PLM is similar to dependency probability in that both features emphasize word pairs that occurred as bigrams in the source sentence.
bigram is mentioned in 9 sentences in this paper.
Topics mentioned in this paper:
Bicknell, Klinton and Levy, Roger
Simulation 1
Our reader’s language model was an unsmoothed bigram model created using a vocabulary set con-
Simulation 1
From this vocabulary, we constructed a bigram model using the counts from every bigram in the BNC for which both words were in vocabulary (about 222,000 bigrams ).
Simulation 1
Specifically, we constructed the model’s initial belief state (i.e., the distribution over sentences given by its language model) by directly translating the bigram model into a wFSA in the log semiring.
Simulation 2
Instead, we begin with the same set of bigrams used in Sim.
Simulation 2
1 — i.e., those that contain two in-vocabulary words — and trim this set by removing rare bigrams that occur less than 200 times in the BNC (except that we do not trim any bigrams that occur in our test corpus).
Simulation 2
This reduces our set of bigrams to about 19,000.
bigram is mentioned in 8 sentences in this paper.
Topics mentioned in this paper:
Sun, Xu and Wang, Houfeng and Li, Wenjie
System Architecture
To derive word features, first of all, our system automatically collect a list of word unigrams and bigrams from the training data.
System Architecture
To avoid overfitting, we only collect the word unigrams and bigrams whose frequency is larger than 2 in the training set.
System Architecture
This list of word unigrams and bigrams are then used as a unigram-dictionary and a bigram-dictionary to generate word-based unigram and bigram features.
bigram is mentioned in 8 sentences in this paper.
Topics mentioned in this paper:
Rasooli, Mohammad Sadegh and Lippincott, Thomas and Habash, Nizar and Rambow, Owen
Evaluation
To our surprise, the Fixed Affix model does a slightly better job in reducing out of vocabulary than the Bigram Affix model.
Evaluation
WoTr 24.21 Bigram Affix Model TRR 25.
Morphology-based Vocabulary Expansion
We use two different models of morphology expansion in this paper: Fixed Affix model and Bigram Affix model.
Morphology-based Vocabulary Expansion
3.2.2 Bigram Affix Expansion Model
Morphology-based Vocabulary Expansion
In the Bigram Affix model, we do the same for the stem as in the Fixed Affix model, but for prefixes and suffixes, we create a bigram language model in the finite state machine.
bigram is mentioned in 8 sentences in this paper.
Topics mentioned in this paper:
Zaslavskiy, Mikhail and Dymetman, Marc and Cancedda, Nicola
Experiments
Bigram based reordering.
Experiments
First we consider a bigram Language Model and the algorithms try to find the reordering that maximizes the LM score.
Experiments
This means that, when using a bigram language model, it is often possible to reorder the words of a randomly permuted reference sentence in such a way that the LM score of the reordered sentence is larger than the LM of the reference.
Phrase-based Decoding as TSP
o The language model cost of producing the target words of 19’ right after the target words of b; with a bigram language model, this cost can be precomputed directly from b and b’.
Phrase-based Decoding as TSP
This restriction to bigram models will be removed in Section 4.1.
Phrase-based Decoding as TSP
4.1 From Bigram to N-gram LM
bigram is mentioned in 8 sentences in this paper.
Topics mentioned in this paper:
Gormley, Matthew R. and Mitchell, Margaret and Van Durme, Benjamin and Dredze, Mark
Approaches
The clusters are formed by a greedy hierachi-cal clustering algorithm that finds an assignment of words to classes by maximizing the likelihood of the training data under a latent-class bigram model.
Approaches
First, for SRL, it has been observed that feature bigrams (the concatenation of simple features such as a predicate’s POS tag and an argument’s word) are important for state-of-the-art (Zhao et al., 2009; Bjorkelund et al., 2009).
Approaches
We consider both template unigrams and bigrams , combining two templates in sequence.
Experiments
Each of 1G0 and 1GB also include 32 template bigrams selected by information gain on 1000 sentences—we select a different set of template bigrams for each dataset.
Experiments
However, the original unigram Bjorkelund features (Bdeflmemh), which were tuned for a high-resource model, obtain higher Fl than our information gain set using the same features in unigram and bigram templates (1GB).
Experiments
In Czech, we disallowed template bigrams involving path-grams.
bigram is mentioned in 8 sentences in this paper.
Topics mentioned in this paper:
Andreevskaia, Alina and Bergler, Sabine
Experiments
Consistent with findings in the literature (Cui et al., 2006; Dave et al., 2003; Gamon and Aue, 2005), on the large corpus of movie review texts, the in-domain-trained system based solely on unigrams had lower accuracy than the similar system trained on bigrams .
Experiments
But the trigrams fared slightly worse than bigrams .
Experiments
On sentences, however, we have observed an inverse pattern: unigrams performed better than bigrams and trigrams.
Factors Affecting System Performance
System runs with unigrams, bigrams , and trigrams as features and with different training set sizes are presented.
Integrating the Corpus-based and Dictionary-based Approaches
In the ensemble of classifiers, they used a combination of nine SVM-based classifiers deployed to learn unigrams, bigrams , and trigrams on three different domains, while the fourth domain was used as an evaluation set.
bigram is mentioned in 8 sentences in this paper.
Topics mentioned in this paper:
Zhao, Qiuye and Marcus, Mitch
Abstract
We consider bigram and trigram templates for generating potentially deterministic constraints.
Abstract
bigram constraint includes one contextual word (w_1|w1) or the corresponding morph feature; and a trigram constraint includes both contextual words or their morph features.
Abstract
precision recall F1 bigram 0.993 0.841 0.911 trigram 0.996 0.608 0.755
bigram is mentioned in 8 sentences in this paper.
Topics mentioned in this paper:
Bramsen, Philip and Escobar-Molano, Martha and Patel, Ami and Alonso, Rafael
Abstract
To illustrate, consider the following feature set, a bigram and a trigram (each term in the n-gram either has the form word or Atag):
Abstract
please AVB and 9 is the number of bigrams in T, excluding sentence initial and final markers.
Abstract
Unigrams and Bigrams : As a different sort of baseline, we considered the results of a bag-of-words based classifier.
bigram is mentioned in 7 sentences in this paper.
Topics mentioned in this paper:
Kalchbrenner, Nal and Grefenstette, Edward and Blunsom, Phil
Experiments
The baselines NB and BINB are Naive Bayes classifiers with, respectively, unigram features and unigram and bigram features.
Experiments
SVM is a support vector machine with unigram and bigram features.
Experiments
unigram, bigram , trigram 92.6 MAXENT POS, chunks, NE, supertags
Introduction
On the hand-labelled test set, the network achieves a greater than 25% reduction in the prediction error with respect to the strongest unigram and bigram baseline reported in Go et al.
bigram is mentioned in 7 sentences in this paper.
Topics mentioned in this paper:
Ott, Myle and Choi, Yejin and Cardie, Claire and Hancock, Jeffrey T.
Automated Approaches to Deceptive Opinion Spam Detection
Specifically, we consider the following three n-gram feature sets, with the corresponding features lowercased and unstemmed: UNIGRAMS, BIGRAMS+ , TRIGRAMS+, where the superscript + indicates that the feature set subsumes the preceding feature set.
Automated Approaches to Deceptive Opinion Spam Detection
We consider all three n-gram feature sets, namely UNIGRAMS, BIGRAMS+ , and TRIGRAMS+, with corresponding language models smoothed using the interpolated Kneser-Ney method (Chen and Goodman, 1996).
Automated Approaches to Deceptive Opinion Spam Detection
We use SVMlight (Joachims, 1999) to train our linear SVM models on all three approaches and feature sets described above, namely POS, LIWC, UNIGRAMS, BIGRAMS+ , and TRIGRAMS+.
Conclusion and Future Work
Specifically, our findings suggest the importance of considering both the context (e.g., BIGRAMS+ ) and motivations underlying a deception, rather than strictly adhering to a universal set of deception cues (e.g., LIWC).
Results and Discussion
This suggests that a universal set of keyword-based deception cues (e.g., LIWC) is not the best approach to detecting deception, and a context-sensitive approach (e.g., BIGRAMS+ ) might be necessary to achieve state-of-the-art deception detection performance.
Results and Discussion
Additional work is required, but these findings further suggest the importance of moving beyond a universal set of deceptive language features (e. g., LIWC) by considering both the contextual (e. g., BIGRAMS+ ) and motivational parameters underlying a deception as well.
bigram is mentioned in 7 sentences in this paper.
Topics mentioned in this paper:
Wang, Xiaolin and Utiyama, Masao and Finch, Andrew and Sumita, Eiichiro
Complexity Analysis
For the monolingual bigram model, the number of states in the HMM is U times more than that of the monolingual unigram model, as the states at specific position of F are not only related to the length of the current word, but also related to the length of the word before it.
Complexity Analysis
NPY( bigram )a 0.750 0.802 17 m —NPY(trigram)a 0.757 0.807
Complexity Analysis
HDP( bigram )b 0.723 — 10 h —FitnessC — 0.667 — —Prop.
bigram is mentioned in 7 sentences in this paper.
Topics mentioned in this paper:
Tang, Duyu and Wei, Furu and Yang, Nan and Zhou, Ming and Liu, Ting and Qin, Bing
Related Work
We learn embedding for unigrams, bigrams and trigrams separately with same neural network and same parameter setting.
Related Work
25$ employs the embedding of unigrams, bigrams and trigrams separately and conducts the matrix-vector operation of ac on the sequence represented by columns in each lookup table.
Related Work
Lum, Lbi and Lm are the lookup tables of the unigram, bigram and trigram embedding, respectively.
bigram is mentioned in 7 sentences in this paper.
Topics mentioned in this paper:
Tan, Chenhao and Lee, Lillian and Pang, Bo
Introduction
twitter unigram TTT * YES (54%) twitter bigram TTT * YES (52%) personal uni gram MT * YES (52%) personal bigram — NO (48%)
Introduction
We measure a tweet’s similarity to expectations by its score according to the relevant language model, fi ZweTlog(p(m)), where T refers to either all the unigrams (unigram model) or all and only bi-grams ( bigram model).16 We trained a Twitter-community language model from our 558M unpaired tweets, and personal language models from each author’s tweet history.
Introduction
16The tokens [at], [hashtag], [url] were ignored in the unigram-model case to prevent their undue influence, but retained in the bigram model to capture longer-range usage (“combination”) patterns.
bigram is mentioned in 7 sentences in this paper.
Topics mentioned in this paper:
Navigli, Roberto and Velardi, Paola
Experiments
0 Bigrams: an implementation of the bigram classifier for soft pattern matching proposed by Cui et al.
Experiments
The probability is calculated as a mixture of bigram and
Experiments
WCL—l 99.88 42.09 59.22 76.06 WCL—3 98.81 60.74 75.23 83.48 Star patterns 86.74 66.14 75.05 81.84 Bigrams 66.70 82.70 73.84 75.80
bigram is mentioned in 7 sentences in this paper.
Topics mentioned in this paper:
Lavergne, Thomas and Cappé, Olivier and Yvon, François
Conditional Random Fields
In the sequel, we distinguish between two types of feature functions: unigramfea-tures fyflc, associated with parameters Hwy, and bigram features fy/Mgc, associated with parameters Aggy)?
Conditional Random Fields
On the other hand, bigram features {fy/,y,$}(y,$)€y2xX are helpful in modelling dependencies between successive labels.
Conditional Random Fields
Assume the set of bigram features {Ag/,y,$t+l}(y/,y)€y2 is sparse with only r(:ct+1) << |Y 2 non null values and define the |Y| >< |Y| sparse matrix
bigram is mentioned in 7 sentences in this paper.
Topics mentioned in this paper:
Liu, Jenny and Haghighi, Aria
Experiments
We also kept NPs with only 1 modifier to be used for generating <m0difie1; head n0un> bigram counts at training time.
Experiments
For example, the NP “the beautiful blue Macedonian vase” generates the following bigrams : <beautzful blue>, <blue Macedonian>, and <beautlful Macedonian>, along with the 3-gram <beautlful blue Macedonian>.
Experiments
In addition, we also store a table that keeps track of bigram counts for < M, H >, where H is the head noun of an NP and M is the modifier closest to it.
Related Work
Shaw and Hatzivassiloglou also use a transitivity method to fill out parts of the Count table where bigrams are not actually seen in the training data but their counts can be inferred from other entries in the table, and they use a clustering method to group together modifiers with similar positional preferences.
Related Work
Shaw and Hatzivassiloglou report a highest accuracy of 94.93% and a lowest accuracy of 65.93%, but since their methods depend heavily on bigram counts in the training corpus, they are also limited in how informed their decisions can be if modifiers in the test data are not present at training time.
bigram is mentioned in 6 sentences in this paper.
Topics mentioned in this paper:
Razmara, Majid and Siahbani, Maryam and Haffari, Reza and Sarkar, Anoop
Experiments & Results 4.1 Experimental Setup
The measures are evaluated by fixing the window size to 4 and maximum candidate paraphrase length to 2 (e. g. bigram ).
Experiments & Results 4.1 Experimental Setup
- - -unigram --- bigram —trigram "" " quadgram
Experiments & Results 4.1 Experimental Setup
Type Node \ MRR % \ RCL % \ Bipartite unigram 5.2 12.5 bigram 6.8 15.7 Tripartite unigram 5.9 12.6 bigram 6.9 15.9 Baseline bigram 3.9 7.7
bigram is mentioned in 6 sentences in this paper.
Topics mentioned in this paper:
Yannakoudakis, Helen and Briscoe, Ted and Medlock, Ben
Approach
(a) Word unigrams (b) Word bigrams
Approach
(a) PoS unigrams (b) PoS bigrams (c) PoS trigrams
Approach
Word unigrams and bigrams are lower-cased and used in their inflected forms.
Previous work
The Bayesian Essay Test Scoring sYstem (BETSY) (Rudner and Liang, 2002) uses multinomial or Bernoulli Naive Bayes models to classify texts into different classes (e. g. pass/fail, grades AF) based on content and style features such as word unigrams and bigrams , sentence length, number of verbs, noun—verb pairs etc.
Validity tests
(a) word unigrams within a sentence (b) word bigrams within a sentence (c) word trigrams within a sentence
bigram is mentioned in 6 sentences in this paper.
Topics mentioned in this paper:
Wang, Aobo and Kan, Min-Yen
Methodology
In response to these difficulties in differentiating linguistic registers, we compute two different PMI scores for character-based bigrams from two large corpora representing news and microblogs as features.
Methodology
In addition, we also convert all the character-based bigrams into Pinyin-based bigrams (ignoring tones5) and compute the Pinyin-level PMI in the same way.
Methodology
These features capture inconsistent use of the bigram across the two domains, which assists to distinguish informal words.
bigram is mentioned in 6 sentences in this paper.
Topics mentioned in this paper:
Kaji, Nobuhiro and Fujiwara, Yasuhiro and Yoshinaga, Naoki and Kitsuregawa, Masaru
Introduction
where we explicitly distinguish the unigram feature function o; and bigram feature function Comparing the form of the two functions, we can see that our discussion on HMMs can be extended to perceptrons by substituting 2k wigbflwwn) and 2k wg¢%(w,yn_1,yn) for logp(:cn|yn) and 10gp(yn|yn—1)-
Introduction
For bigram features, we compute its upper bound offline.
Introduction
The simplest case is that the bigram features are independent of the token sequence :13.
bigram is mentioned in 6 sentences in this paper.
Topics mentioned in this paper:
Maxwell, K. Tamsin and Oberlander, Jon and Croft, W. Bruce
Evaluation framework
The second clique contains query bigrams that match
Evaluation framework
document bigrams in 2-word ordered windows (‘#I’), A2 = 0.1.
Evaluation framework
The third clique uses the same bigrams as clique 2 with an 8-word unordered window (‘#uw8’), A3 = 0.05.
Introduction
Integration of the identified catenae in queries also improves IR effectiveness compared to a highly effective baseline that uses sequential bigrams with no linguistic knowledge.
bigram is mentioned in 5 sentences in this paper.
Topics mentioned in this paper:
Chen, Ruey-Cheng
Change in Description Length
Suppose that the original sequence VVz-_1 is N -word long, the selected word type pair cc and 3/ each occurs k and 1 times, respectively, and altogether x-y bigram occurs m times in VVz-_1.
Change in Description Length
In the new sequence Wi, each of the m bigrams is replaced with an unseen word 2 = my.
Regularized Compression
Hence, a new sequence IV,-is created in the i-th iteration by merging all the occurrences of some selected bigram (cc, 3/) in the original sequence Wi_1.
Regularized Compression
Note that f(:c, y) is the bigram frequency, |Wi_1| the sequence length of VVi_1, and AH(W,-_1, m) = FI(W,-) — H(W,-_1) is the difference between the empirical Shannon entropy measured on 1% and VVi_1, using maximum likelihood estimates.
Regularized Compression
In the new sequence 1%, each occurrence of the 53-3; bigram is replaced with a new (conceptually unseen) word 2.
bigram is mentioned in 5 sentences in this paper.
Topics mentioned in this paper:
Li, Zhifei and Eisner, Jason and Khudanpur, Sanjeev
Experimental Results
Moreover, a bigram (i.e., “2gram”) achieves the best BLEU scores among the four different orders of VMs.
Experimental Results
This is necessarily true, but it is interesting to see that most of the improvement is obtained just by moving from a unigram to a bigram model.
Experimental Results
Indeed, although Table 3 shows that better approximations can be obtained by using higher-order models, the best BLEU score in Tables 2a and 2c was obtained by the bigram model.
Variational Approximate Decoding
However, because q* only approximates p, y* of (13) may be locally appropriate but globally inadequate as a translation of c. Observe, e. g., that an n-gram model q* will tend to favor short strings y, regardless of the length of c. Suppose cc 2 le chat chasse la souris (“the cat chases the mouse”) and q* is a bigram approximation to p(y | Presumably q* (the | START), q* (mouse | the), and q*(END | mouse) are all large in So the most probable string y* under q* may be simply “the mouse,” which is short and has a high probability but fails to cover cc.
bigram is mentioned in 5 sentences in this paper.
Topics mentioned in this paper:
Zhu, Muhua and Zhang, Yue and Chen, Wenliang and Zhang, Min and Zhu, Jingbo
Baseline parser
bigrams sowslw, sowslc, socslw, 300310, sowqow, sowqot, socqow, socqot, qo’wa’w, (1010(1175, (10731110, (Jotmt, slwqow, slwqot, slcqow, slcqot
Semi-supervised Parsing with Large Data
2 From the dependency trees, we extract bigram lexical dependencies (2121,2122, L/R) where the symbol L (R) means that w1 (2112) is the head of ’LU2 (wl).
Semi-supervised Parsing with Large Data
(2009), we assign categories to bigram and trigram items separately according to their frequency counts.
Semi-supervised Parsing with Large Data
Hereafter, we refer to the bigram and trigram lexical dependency lists as BLD and TLD, respectively.
bigram is mentioned in 5 sentences in this paper.
Topics mentioned in this paper:
Song, Young-In and Lee, Jung-Tae and Rim, Hae-Chang
Experiments
Phrase extraction and indexing: We evaluate our proposed method on two different types of phrases: syntactic head-modifier pairs (syntactic phrases) and simple bigram phrases (statistical phrases).
Experiments
Since our method is not limited to a particular type of phrases, we have also conducted experiments on statistical phrases ( bigrams ) with a reduced set of features directed applicable; RMO, RSO, PD5, DF, and CPP; the features requiring linguistic preprocessing (e. g. PPT) are not used, because it is unrealistic to use them under bigram—based retrieval setting.
Experiments
5In most cases, the distance between words in a bigram is l, but sometimes, it could be more than 1 because of the effect of stopword removal.
bigram is mentioned in 5 sentences in this paper.
Topics mentioned in this paper:
Celikyilmaz, Asli and Hakkani-Tur, Dilek
Experiments and Discussions
We use R-l (recall against unigrams), R-2 (recall against bigrams), and R-SU4 (recall against skip-4 bigrams ).
Experiments and Discussions
Note that R-2 is a measure of bigram recall and sumHLDA of HybHSumg is built on unigrams rather than bigrams .
Regression Model
We similarly include bigram features in the experiments.
Regression Model
We also include bigram extensions of DMF features.
Regression Model
We use sentence bigram frequency, sentence rank in a document, and sentence size as additional fea-
bigram is mentioned in 5 sentences in this paper.
Topics mentioned in this paper:
Sun, Weiwei and Uszkoreit, Hans
Capturing Paradigmatic Relations via Word Clustering
The quality is defined based on a class-based bigram language model as follows.
Capturing Paradigmatic Relations via Word Clustering
The objective function is maximizing the likelihood H2121 P(w7;|w1, ..., wi_1) of the training data given a partially class-based bigram model of the form
State-of-the-Art
Word bigrams : w_2_w_1, w_1_w, w_w+1, w+1_w+2; In order to better handle unknown words, we extract morphological features: character n- gram prefixes and suffixes for n up to 3.
State-of-the-Art
(2009) introduced a bigram HMM model with latent variables (Bi gram HMM-LA in the table) for Chinese tagging.
State-of-the-Art
Trigram HMM (Huang et al., 2009) 93.99% Bigram HMM-LA (Huang et al., 2009) 94.53% Our tagger 94.69%
bigram is mentioned in 5 sentences in this paper.
Topics mentioned in this paper:
Szarvas, Gy"orgy
Introduction
0 The extension of the feature representation used by previous works with bigrams and trigrams and an evaluation of the benefit of using longer keywords in hedge classification.
Methods
For trigrams, bigrams and unigrams — processed separately — we calculated a new class-conditional probability for each feature cc, discarding those observations of c in speculative instances where c was not among the two highest ranked candidate.
Results
These keywords were many times used in a mathematical context (referring to probabilities) and thus expressed no speculative meaning, while such uses were not represented in the FlyBase articles (otherwise bigram or trigram features could have captured these non-speculative uses).
Results
One third of the features used by our advanced model were either bigrams or trigrams.
Results
Our model using just unigram features achieved a BEP(spec) score of 78.68% and F5=1(spec) score of 80.23%, which means that using bigram and trigram hedge cues here significantly improved the performance (the difference in BEP (spec) and F5=1(spec) scores were 5.23% and 4.97%, respectively).
bigram is mentioned in 5 sentences in this paper.
Topics mentioned in this paper:
Green, Spence and DeNero, John
A Class-based Model of Agreement
The features are indicators for (character, position, label) triples for a five Character window and bigram label transition indicators.
A Class-based Model of Agreement
Bigram transition features gbt encode local agreement relations.
A Class-based Model of Agreement
We trained a simple add-1 smoothed bigram language model over gold class sequences in the same treebank training data:
bigram is mentioned in 5 sentences in this paper.
Topics mentioned in this paper:
Constant, Matthieu and Sigogne, Anthony and Watrin, Patrick
MWE-dedicated Features
We use word unigrams and bigrams in order to capture multiwords present in the training section and to extract lexical cues to discover new MWEs.
MWE-dedicated Features
For instance, the bigram coup de is often the prefix of compounds such as coup de pied (kick), coup de foudre (love at first sight), coup de main (help).
MWE-dedicated Features
We use part-of-speech unigrams and bigrams in order to capture MWEs with irregular syntactic structures that might indicate the id-iomacity of a word sequence.
bigram is mentioned in 5 sentences in this paper.
Topics mentioned in this paper:
Fleischman, Michael and Roy, Deb
Evaluation
The remaining 93 unlabeled games are used to train unigram, bigram , and trigram grounded language models.
Evaluation
Only unigrams, bigrams , and tri—grams that are not proper names, appear greater than three times, and are not composed only of stop words were used.
Evaluation
with traditional unigram, bigram , and trigram language models generated from a combination of the closed captioning transcripts of all training games and data from the switchboard corpus (see below).
Linguistic Mapping
Estimating bigram and trigram models can be done by processing on word pairs or triples, and performing normalization on the resulting conditional distributions.
bigram is mentioned in 5 sentences in this paper.
Topics mentioned in this paper:
Ramanath, Rohan and Liu, Fei and Sadeh, Norman and Smith, Noah A.
Approach
In our formulation, each hidden state corresponds to an issue or topic, characterized by a distribution over words and bigrams appearing in privacy policy sections addressing that issue.
Approach
0,; is generated by repeatedly sampling from a distribution over terms that includes all unigrams and bigrams except those that occur in fewer than 5% of the documents and in more than 98% of the documents.
Approach
models (e. g., a bigram may be generated by as many as three draws from the emission distribution: once for each unigram it contains and once for the bigram ).
Experiment
Our second baseline is latent Dirichlet allocation (LDA; Blei et al., 2003), with ten topics and online variational Bayes for inference (Hoffman et al., 2010).7 To more closely match our models, LDA is given access to the same unigram and bigram tokens.
bigram is mentioned in 4 sentences in this paper.
Topics mentioned in this paper:
Kobayashi, Hayato
Experiments
In the case of bigrams , the perpleXities of TheoryZ are almost the same as that of Zipf2 when the size of reduced vocabulary is large.
Perplexity on Reduced Corpora
This model seems to be stupid, since we can easily notice that the bigram “is is” is quite frequent, and the two bigrams “is a” and “a is” have the same frequency.
Perplexity on Reduced Corpora
For example, the decay function 92 of bigrams is as follows:
Perplexity on Reduced Corpora
They pointed out that the exponent of bigrams is about 0.66, and that of 5-grams is about 0.59 in the Wall Street Journal corpus (WSJ 87).
bigram is mentioned in 4 sentences in this paper.
Topics mentioned in this paper:
Prabhakaran, Vinodkumar and Rambow, Owen
Predicting Direction of Power
Baseline (Always Superior) 52.54 Baseline (Word Unigrams + Bigrams ) 68.56 THRNCW 55.90 THRPR 54.30 DIAPR 54.05 THRPR + THRNew 61.49 DIAPR + THRPR + THRNew 62.47 LEX 70.74 LEX + DIAPR + THRPR 67.44 LEX + DIAPR + THRPR + THRNew 68.56 BEST (= LEX + THRNeW) 73.03 BEST (Using p1 features only) 72.08 BEST (Using IMt features only) 72.11 BEST (Using Mt only) 71.27 BEST (No Indicator Variables) 72.44
Predicting Direction of Power
We found the best setting to be using both unigrams and bigrams for all three types of ngrams, by tuning in our dev set.
Predicting Direction of Power
We also use a stronger baseline using word unigrams and bigrams as features, which obtained an accuracy of 68.6%.
bigram is mentioned in 4 sentences in this paper.
Topics mentioned in this paper:
Elsner, Micha and Goldwater, Sharon and Eisenstein, Jacob
Conclusion
We have presented a noisy-channel model that simultaneously learns a lexicon, a bigram language model, and a model of phonetic variation, while using only the noisy surface forms as training data.
Introduction
Previous models with similar goals have learned from an artificial corpus with a small vocabulary (Driesen et al., 2009; Rasanen, 2011) or have modeled variability only in vowels (Feldman et al., 2009); to our knowledge, this paper is the first to use a naturalistic infant-directed corpus while modeling variability in all segments, and to incorporate word-level context (a bigram language model).
Introduction
Our model is conceptually similar to those used in speech recognition and other applications: we assume the intended tokens are generated from a bigram language model and then distorted by a noisy channel, in particular a log-linear model of phonetic variability.
Related work
In contrast, our model uses a symbolic representation for sounds, but models variability in all segment types and incorporates a bigram word-level language model.
bigram is mentioned in 4 sentences in this paper.
Topics mentioned in this paper:
Wang, Zhiguo and Xue, Nianwen
Joint POS Tagging and Parsing with Nonlocal Features
(Collins and Koo, 2005) (Charniak and Johnson, 2005) Rules CoPar HeadTree Bigrams CoLenPar
Joint POS Tagging and Parsing with Nonlocal Features
Grandparent Bigrams Heavy
Joint POS Tagging and Parsing with Nonlocal Features
Lexical Bigrams Neighbours
bigram is mentioned in 4 sentences in this paper.
Topics mentioned in this paper:
Zhang, Hui and Chiang, David
Count distributions
For example, suppose that our data consists of the following bigrams , with their weights:
Word Alignment
That is, during the E step, we calculate the distribution of C(e, f) for each e and f, and during the M step, we train a language model on bigrams e f using expected KN smoothing (that is, with u = e and w = f).
Word Alignment
(The latter case is equivalent to a backoff language model, where, since all bigrams are known, the lower-order model is never used.)
Word Alignment
This is much less of a problem in KN smoothing, where p’ is estimated from bigram types rather than bigram tokens.
bigram is mentioned in 4 sentences in this paper.
Topics mentioned in this paper:
Hermjakob, Ulf and Knight, Kevin and Daumé III, Hal
Learning what to transliterate
From the stat section we collect statistics as to how often every word, bigram or trigram occurs, and what distribution of name/non-name patterns these ngrams have.
Learning what to transliterate
The name distribution bigram
Learning what to transliterate
stat corpus bitext, the first word is a marked up as a non-name (”0”) and the second as a name (”1”), which strongly suggests that in such a bigram context, aljzyre better be translated as island or peninsula, and not be transliterated as Al-Jazeera.
Transliterator
The same consonant skeleton indexing process is applied to name bigrams (47,700,548 unique with 167,398,054 skeletons) and trigrams (46,543,712 unique with 165,536,451 skeletons).
bigram is mentioned in 4 sentences in this paper.
Topics mentioned in this paper:
Kim, Young-Bum and Snyder, Benjamin
Analysis
with a bigram HMM with four language clusters.
Inference
where n(t) and n(t, t’) are, respectively, unigram and bigram tag counts excluding those containing character w. Conversely, n’(t) and n’(t,t’) are, respectively, unigram and bigram tag counts only including those containing character w. The notation am] denotes the ascending factorial: a(a + l) - - - (a +n — 1).
Inference
where n(j, 19,25) and n(j, 19,75, 25’) are the numbers of languages currently assigned to cluster k which have more than j occurrences of unigram (t) and bigram (t, t’ ), respectively.
Model
We note that in practice, we implemented a trigram version of the model,2 but we present the bigram version here for notational clarity.
bigram is mentioned in 4 sentences in this paper.
Topics mentioned in this paper:
Kim, Joohyun and Mooney, Raymond
Reranking Features
Bigram .
Reranking Features
Indicates whether a given bigram of nonterminal/terminals occurs for given a parent nonterminal: f (L1 —> L2 : L3) = l.
Reranking Features
Grandparent Bigram .
bigram is mentioned in 4 sentences in this paper.
Topics mentioned in this paper:
Surdeanu, Mihai and Ciaramita, Massimiliano and Zaragoza, Hugo
Approach
Bigrams (B) - the text is represented as a bag of bigrams (larger n-grams did not help).
Approach
Generalized bigrams (Bg) - same as above, but the words are generalized to their WNSS.
Experiments
The first chosen feature is the translation probability computed between the B 9 question and answer representations ( bigrams with words generalized to their WNSS tags).
Experiments
This is caused by the fact that the BM25 formula is less forgiving with errors of the NLP processors (due to the high idf scores assigned to bigrams and dependencies), and the WNSS tagger is the least robust component in our pipeline.
bigram is mentioned in 4 sentences in this paper.
Topics mentioned in this paper:
DeNero, John and Chiang, David and Knight, Kevin
Computing Feature Expectations
Hyper-edges (boxes) are annotated with normalized transition probabilities, as well as the bigrams produced by each rule application.
Computing Feature Expectations
The expected count of the bigram “man with” is the sum of posterior probabilities of the two hyper-edges that produce it.
Computing Feature Expectations
5For example, decoding under a variational approximation to the model’s posterior that decomposes over bigram probabilities is equivalent to fast consensus decoding with
bigram is mentioned in 4 sentences in this paper.
Topics mentioned in this paper:
Tang, Hao and Keshet, Joseph and Livescu, Karen
Discussion
In the figure, phone bigram TF—IDF is labeled p2; phonetic alignment with dynamic programming is labeled DP.
Experiments
The TF-IDF features used in the experiments are based on phone bigrams .
Feature functions
In practice, we only consider n-grams of a certain order (e. g., bigrams ).
Feature functions
Then for the bi-gram /1 iy/, we have TF/liy/(fo) = 1/5 (one out of five bigrams in 1—9), and IDF /1 iy / = log(2 / 1) (one word out of two in the dictionary).
bigram is mentioned in 4 sentences in this paper.
Topics mentioned in this paper:
Turian, Joseph and Ratinov, Lev-Arie and Bengio, Yoshua
Clustering-based word representations
The Brown algorithm is a hierarchical clustering algorithm which clusters words to maximize the mutual information of bigrams (Brown et al., 1992).
Clustering-based word representations
So it is a class-based bigram language model.
Clustering-based word representations
One downside of Brown clustering is that it is based solely on bigram statistics, and does not consider word usage in a wider context.
bigram is mentioned in 4 sentences in this paper.
Topics mentioned in this paper:
Zhou, Guangyou and Zhao, Jun and Liu, Kang and Cai, Li
Experiments
Web page hits for word pairs and trigrams are obtained using a simple heuristic query to the search engine Google.11 Inflected queries are performed by expanding a bigram or trigram into all its morphological forms.
Experiments
Although Google hits is noisier, it has very much larger coverage of bigrams or trigrams.
Experiments
This means that if pages indexed by Google doubles, then so do the bigrams or trigrams frequencies.
Related Work
Keller and Lapata (2003) evaluated the utility of using web search engine statistics for unseen bigram .
bigram is mentioned in 4 sentences in this paper.
Topics mentioned in this paper:
Jia, Zhongye and Zhao, Hai
Experiments
All of the three smoothing methods for bigram and trigram LMs are examined both using back-off mod-
Pinyin Input Method Model
The edge weight the negative logarithm of conditional probability P(Sj+1,k SM) that a syllable Sm- is followed by Sj+1,k, which is give by a bigram language model of pinyin syllables:
Pinyin Input Method Model
WE(W,j—>Vj+1,k) : _10g P(Vj+1vk Vivi) Although the model is formulated on first order HMM, i.e., the LM used for transition probability is a bigram one, it is easy to extend the model to take advantage of higher order n-gram LM, by tracking longer history while traversing the graph.
bigram is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Bartlett, Susan and Kondrak, Grzegorz and Cherry, Colin
Syllabification with Structured SVMs
For example, the bigram bl frequently occurs within a single English syllable, while the bigram lb generally straddles two syllables.
Syllabification with Structured SVMs
Thus, in addition to the single-letter features outlined above, we also include in our representation any bigrams , trigrams, four-grams, and five-grams that fit inside our context window.
Syllabification with Structured SVMs
As is apparent from Figure 2, we see a substantial improvement by adding bigrams to our feature set.
bigram is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Huang, Liang
Experiments
Local instances NonLocal instances Rule 10, 85 1 ParentRule 18, 019 Word 20, 328 WProj 27, 417 WordEdges 454, 101 Heads 70, 013 CoLenPar 22 HeadTree 67, 836 Bigram <> 10, 292 Heavy 1, 401 Trigram<> 24, 677 NGramTree 67, 559 HeadMod<> 12, 047 RightBranch 2 DistMod<> 16, 017 Total Feature Instances: 800, 582
Experiments
We also restricted NGramTree to be on bigrams only.
Forest Reranking
2 (d) returns the minimum tree fragement spanning a bigram , in this case “saw” and “the”, and should thus be computed at the smallest common ancestor of the two, which is the VP node in this example.
bigram is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Johnson, Mark
Conclusion and future work
We then investigated adaptor grammars that incorporate one additional kind of information, and found that modeling collocations provides the greatest improvement in word segmentation accuracy, resulting in a model that seems to capture many of the same interword dependencies as the bigram model of Goldwater et al.
Word segmentation with adaptor grammars
It is not possible to write an adaptor grammar that directly implements Goldwater’s bigram word segmentation model because an adaptor grammar has one DP per adapted nonterminal (so the number of DPs is fixed in advance) while Goldwater’s bigram model has one DP per word type, and the number of word types is not known in advance.
Word segmentation with adaptor grammars
This suggests that the collocation word adaptor grammar can capture inter-word dependencies similar to those that improve the performance of Goldwater’s bigram segmentation model.
bigram is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Koo, Terry and Carreras, Xavier and Collins, Michael
Background 2.1 Dependency parsing
The algorithm then repeatedly merges the pair of clusters which causes the smallest decrease in the likelihood of the text corpus, according to a class-based bigram language model defined on the word clusters.
Conclusions
To begin, recall that the Brown clustering algorithm is based on a bigram language model.
Feature design
(2005a), and consists of indicator functions for combinations of words and parts of speech for the head and modifier of each dependency, as well as certain contextual tokens.1 Our second-order baseline features are the same as those of Carreras (2007) and include indicators for triples of part of speech tags for sibling interactions and grandparent interactions, as well as additional bigram features based on pairs of words involved these higher-order interactions.
bigram is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Laskowski, Kornel
Experiments
Excluding qt_1 2 qt bigrams (leading to 0.32M frames from 2.39M frames in “all”) offers a glimpse of expected performance differences were duration modeling to be included in the models.
Limitations and Desiderata
To produce Figures 1 and 2, a small fraction of probability mass was reserved for unseen bigram transitions (as opposed to backing off to unigram probabilities).
The Extended-Degree-of-Overlap Model
The EDO model mitigates R-specificity because it models each bigram (qt_1, qt) 2 (8,, S j) as the modified bigram (m, [0ij,nj]), involving three scalars each of which is a sum — a commutative (and therefore rotation-invariant) operation.
bigram is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Shutova, Ekaterina
Automatic Metaphor Recognition
They use hyponymy relation in WordNet and word bigram counts to predict metaphors at a sentence level.
Automatic Metaphor Recognition
Hereby they calculate bigram probabilities of verb-noun and adjective-noun pairs (including the hyponyms/hypernyms of the noun in question).
Automatic Metaphor Recognition
However, by using bigram counts over verb-noun pairs Krishnakumaran and Zhu (2007) loose a great deal of information compared to a system extracting verb-object relations from parsed text.
bigram is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Clifton, Ann and Sarkar, Anoop
Models 2.1 Baseline Models
After CRF based recovery of the suffix tag sequence, we use a bigram language model trained on a full segmented version on the training data to recover the original vowels.
Models 2.1 Baseline Models
We used bigrams only, because the suffix vowel harmony alternation depends only upon the preceding phonemes in the word from which it was segmented.
Models 2.1 Baseline Models
original training data: koskevaa mietintoa kasitellaan segmentation: koske+ +va+ +a mietinto+ +5 kasi+ +te+ +115+ +a+ +n (train bigram language model with mapping A = { a, 'a map final suflia‘ to abstract tag-set: koske+ +va+ +A mietinto+ +A kasi+ +te+ +11a+ +a+ +n (train CRF model to predict the final suffix) peeling of final suflia‘: koske+ +va+ mietinto+ kasi+ +te+ +11a+ +a+ (train SMT model on this transformation of training data) (a) Training
bigram is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Huang, Zhiheng and Chang, Yi and Long, Bo and Crespo, Jean-Francois and Dong, Anlei and Keerthi, Sathiya and Wu, Su-Lin
Experiments
In particular, we use the unigrams of the current and its neighboring words, word bigrams, prefixes and suffixes of the current word, capitalization, all-number, punctuation, and tag bigrams for POS, CoNLL2000 and CoNLL 2003 datasets.
Experiments
For supertag dataset, we use the same features for the word inputs, and the unigrams and bigrams for gold POS inputs.
Problem formulation
Bigram features are of form fk (yt, yt_1, xt) which are concerned with both the previous and the current labels.
bigram is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Li, Fangtao and Gao, Yang and Zhou, Shuchang and Si, Xiance and Dai, Decheng
Experiments
Besides unigram and bigram , the most effective textual feature is URL.
Proposed Features
3.1.1 Unigrams and Bigrams The most common type of feature for text classi-
Proposed Features
feature selection method X2 (Yang and Pedersen, 1997) to select the top 200 unigrams and bigrams as features.
bigram is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Iyyer, Mohit and Enns, Peter and Boyd-Graber, Jordan and Resnik, Philip
Datasets
Their features come from the Linguistic Inquiry and Word Count lexicon (LIWC) (Pennebaker et al., 2001), as well as from lists of “sticky bigrams” (Brown et al., 1992) strongly associated with one party or another (e. g., “illegal aliens” implies conservative, “universal healthcare” implies liberal).
Datasets
We first extract the subset of sentences that contains any words in the LIWC categories of Negative Emotion, Positive Emotion, Causation, Anger, and Kill verbs.3 After computing a list of the top 100 sticky bigrams for each category, ranked by log-likelihood ratio, and selecting another subset from the original data that included only sentences containing at least one sticky bigram , we take the union of the two subsets.
Related Work
They use an HMM-based model, defining the states as a set of fine-grained political ideologies, and rely on a closed set of lexical bigram features associated with each ideology, inferred from a manually labeled ideological books corpus.
bigram is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Konstas, Ioannis and Lapata, Mirella
Experimental Design
Consecutive Word/Bigram/Trigram This feature family targets adjacent repetitions of the same word, bigram or trigram, e.g., ‘show me the show me the
Problem Formulation
The weight of this rule is the bigram probability of two records conditioned on their type, multiplied with a normalization factor 7».
Problem Formulation
Rule (6) defines the expansion of field F to a sequence of (binarized) words W, with a weight equal to the bigram probability of the current word given the previous word, the current record, and field.
bigram is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Sun, Weiwei and Wan, Xiaojun
Structure-based Stacking
0 Character unigrams: ck (i — l S k: S i + l) 0 Character bigrams : ckck+1 (i — l S k: < i + l)
Structure-based Stacking
0 Character label bigrams : cgpdcgffi (i — lppd S
Structure-based Stacking
0 Bigram features: C(sk)C(sk+1) (i — [C S k; < 73 + lo), Tctb(5k)Tctb(3k+1) (i — 1ng g k; < i +1390), Tppd(5k)Tppd(3k+1) (73 — lgpd S k: < 73+ zgpd)
bigram is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Almeida, Miguel and Martins, Andre
Compressive Summarization
(2011), we used stemmed word bigrams as concepts, to which we associate the following concept features ((DCOV): indicators for document counts, features indicating if each of the words in the bigram is a stop-word, the earliest position in a document each concept occurs, as well as two and three-way conjunctions of these features.
Experiments
We generated oracle extracts by maximizing bigram recall with respect to the manual abstracts, as described in Berg-Kirkpatrick et al.
Extractive Summarization
1Previous work has modeled concepts as events (Filatova and Hatzivassiloglou, 2004), salient words (Lin and Bilmes, 2010), and word bigrams (Gillick etal., 2008).
bigram is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Weller, Marion and Fraser, Alexander and Schulte im Walde, Sabine
Using subcategorization information
The model has access to the basic features stem and tag, as well as the new features based on subcat-egorizaion information (explained below), using unigrams within a Window of up to four positions to the right and the left of the current position, as well as bigrams and trigrams for stems and tags (current item + left and/or right item).
Using subcategorization information
In addition to the probability/frequency of the respective functions, we also provide the CRF with bigrams containing the two parts of the tuple,
Using subcategorization information
By providing the parts of the tuple as unigrams, bigrams or trigrams to the CRF, all relevant information is available: verb, noun and the probabilities for the potential functions of the noun in the sentence.
bigram is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Hagiwara, Masato and Sekine, Satoshi
Introduction
Knowing that the back-transliterated un-igram “blacki” and bigram “blacki shred” are unlikely in English can promote the correct WS, jfiafi‘yi/z/I/yh“ “blackish red”.
Use of Language Model
As the English LM, we used Google Web 1T 5-gram Version 1 (Brants and Franz, 2006), limiting it to unigrams occurring more than 2000 times and bigrams occurring more than 500 times.
Word Segmentation Model
We limit the features to word unigram and bigram features, i'e'7 My) 2 Zil¢1<wi) + ¢2(7~Ui—1awi)l for y = wl...wn.
bigram is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Wang, Lu and Raghavan, Hema and Castelli, Vittorio and Florian, Radu and Cardie, Claire
Results
The results in Table 5 use the official ROUGE software with standard options5 and report ROUGE-2 (R-2) (measures bigram overlap) and ROUGE-SU4 (R-SU4) (measures unigram and skip-bigram separated by up to four words).
The Framework
unigram/bigram/skip bigram (at most four words apart) overlap unigram/bigram TF/TF—IDF similarity
The Framework
2011; Ouyang et al., 2011), we use the ROUGE-2 score, which measures bigram overlap between a sentence and the abstracts, as the objective for regression.
bigram is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Roark, Brian and Allauzen, Cyril and Riley, Michael
Introduction
For example, if a bigram parameter is modified due to the presence of some set of trigrams, and then some or all of those trigrams are pruned from the model, the bigram associated with the modified parameter will be unlikely to have an overall expected frequency equal to its observed frequency anymore.
Marginal distribution constraints
Thus the unigram distribution is with respect to the bigram model, the bigram model is with respect to the trigram model, and so forth.
Model constraint algorithm
This can be particularly clearly seen at the unigram state, which has an arc for every unigram (the size of the vocabulary): for every bigram state (also order of the vocabulary), in the naive algorithm we must look for every possible arc.
bigram is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Mukherjee, Arjun and Liu, Bing
Phrase Ranking based on Relevance
This thread of research models bigrams by encoding them into the generative process.
Phrase Ranking based on Relevance
For each word, a topic is sampled first, then its status as a unigram or bigram is sampled, and finally the word is sampled from a topic-specific unigram or bigram distribution.
Phrase Ranking based on Relevance
In (Tomokiyo and Hurst, 2003), a language model approach is used for bigram phrase extraction.
bigram is mentioned in 3 sentences in this paper.
Topics mentioned in this paper:
Kozareva, Zornitsa
Task A: Polarity Classification
We studied the influence of unigrams, bigrams and a combination of the two, and saw that the best performing feature set consists of the combination of unigrams and bigrams .
Task A: Polarity Classification
In this paper, we will refer from now on to n-grams as the combination of unigrams and bigrams .
Task B: Valence Prediction
Those include n-grams (unigrams, bigrams and combination of the two), LIWC scores.
bigram is mentioned in 3 sentences in this paper.
Topics mentioned in this paper: