Abstract | In this paper, we propose a bigram based supervised method for extractive document summarization in the integer linear programming (ILP) framework. |
Abstract | For each bigram , a regression model is used to estimate its frequency in the reference summary. |
Abstract | The regression model uses a variety of indicative features and is trained discriminatively to minimize the distance between the estimated and the ground truth bigram frequency in the reference summary. |
Introduction | They used bigrams as such language concepts. |
Introduction | Gillick and Favre (Gillick and Favre, 2009) used bigrams as concepts, which are selected from a subset of the sentences, and their document frequency as the weight in the objective function. |
Introduction | In this paper, we propose to find a candidate summary such that the language concepts (e. g., bigrams ) in this candidate summary and the reference summary can have the same frequency. |
Proposed Method 2.1 Bigram Gain Maximization by ILP | We choose bigrams as the language concepts in our proposed method since they have been successfully used in previous work. |
Proposed Method 2.1 Bigram Gain Maximization by ILP | In addition, we expect that the bigram oriented ILP is consistent with the ROUGE-2 measure widely used for summarization evaluation. |
Abstract | Evaluated on the WSJ corpus, bigram and trigram model perplexity were reduced up to 23.5% and 14.0%, respectively. |
Abstract | Compared to the distant bigram , we show that word-pairs can be more effectively modeled in terms of both distance and occurrence. |
Language Modeling with TD and TO | The prior, which is usually implemented as a unigram model, can be also replaced with a higher order n-gram model as, for instance, the bigram model: |
Perplexity Evaluation | As seen from the table, for lower order n- gram models, the complementary information captured by the TD and TO components reduced the perplexity up to 23.5% and 14.0%, for bigram and trigram models, respectively. |
Perplexity Evaluation | ter modeling of word-pairs compared to the distant bigram model. |
Perplexity Evaluation | Here we compare the perplexity of both, the distance-k bigram model and distance-k TD model (for values of k ranging from two to ten), when combined with a standard bi gram model. |
Related Work | The distant bigram model (Huang et.al 1993, Simon et al. |
Related Work | 2007) disassembles the n-gram into (n—l) word-pairs, such that each pair is modeled by a distance-k bigram model, where 1 S k s n — 1 . |
Related Work | Each distance-k bigram model predicts the target-word based on the occurrence of a history-word located k positions behind. |
Abstract | We find that Bigram dependencies are important for performing well on real data and for learning appropriate deletion probabilities for different contexts.1 |
Experiments 4.1 The data | Best performance for both the Unigram and the Bigram model in the GOLD-p condition is achieved under the left-right setting, in line with the standard analyses of /t/-deleti0n as primarily being determined by the preceding and the following context. |
Experiments 4.1 The data | For the LEARN-p condition, the Bigram model still performs best in the left-right setting but the Unigram model’s performance drops |
Experiments 4.1 The data | Note how the Unigram model always suffers in the LEARN- p condition whereas the Bigram model’s performance is actually best for LEARN- p in the left-right setting. |
Introduction | We find that models that capture bigram dependencies between underlying forms provide considerably more accurate estimates of those probabilities than corresponding unigram or “bag of words” models of underlying forms. |
The computational model | Our models build on the Unigram and the Bigram model introduced in Goldwater et al. |
The computational model | Figure 1 shows the graphical model for our joint Bigram model (the Unigram case is trivially recovered by generating the Ums directly from L rather than from LUi,j_1). |
The computational model | Figure l: The graphical model for our joint model of word-final /t/-deletion and Bigram word segmentation. |
Models for Measuring Grammatical Competence | Then, regarding POS bigrams as terms, they construct POS-based vector space models for each score-class (there are four score classes denoting levels of proficiency as will be explained in Section 5.2), thus yielding four score-specific vector-space models (VSMs). |
Models for Measuring Grammatical Competence | 0 0034: the cosine similarity score between the test response and the vector of POS bigrams for the highest score class (level 4); and, |
Models for Measuring Grammatical Competence | First, the VSM-based method is likely to overestimate the contribution of the POS bigrams when highly correlated bigrams occur as terms in the VSM. |
Related Work | In order to avoid the problems encountered with deep analysis-based measures, Yoon and Bhat (2012) explored a shallow analysis-based approach, based on the assumption that the level of grammar sophistication at each proficiency level is reflected in the distribution of part-of-speech (POS) tag bigrams . |
Shallow-analysis approach to measuring syntactic complexity | The measures of syntactic complexity in this approach are POS bigrams and are not obtained by a deep analysis (syntactic parsing) of the structure of the sentence. |
Shallow-analysis approach to measuring syntactic complexity | In a shallow-analysis approach to measuring syntactic complexity, we rely on the distribution of POS bigrams at every profi- |
Shallow-analysis approach to measuring syntactic complexity | Consider the two sentence fragments below taken from actual responses (the bigrams of interest and their associated POS tags are boldfaced). |
Distribution Prediction | For this purpose, we represent a word 21) using unigrams and bigrams that co-occur with w in a sentence as follows. |
Distribution Prediction | Next, we generate bigrams of word lemmas and remove any bigrams that consists only of stop words. |
Distribution Prediction | Bigram features capture negations more accurately than unigrams, and have been found to be useful for sentiment classification tasks. |
Experimental Setup | 256,873 unique unigrams and 4,494,222 unique bigrams . |
Experimental Setup | We cluster unigrams (i = l) and bigrams (i = 2). |
Experimental Setup | SRILM does not directly support bigram clustering. |
Models | The parameters d’, d”, and d’” are the discounts for unigrams, bigrams and trigrams, respectively, as defined by Chen and Goodman (1996, p. 20, (26)). |
Models | bigram ) histories that is covered by the clusters. |
Models | We cluster bigram histories and unigram histories separately and write 193 (7.03 |w1w2) for the bigram cluster model and pB(w3|w2) for the unigram cluster model. |
Experiment | Model PKU MSRA Best05(Chen et al., 2005) 95.0 96.0 Best05(Tseng et al., 2005) 95.0 96.4 (Zhang et al., 2006) 95.1 97.1 (Zhang and Clark, 2007) 94.5 97.2 (Sun et al., 2009) 95.2 97.3 (Sun et al., 2012) 95.4 97.4 (Zhang et al., 2013) 96.1 97.4 MMTNN 94.0 94.9 MMTNN + bigram 95.2 97.2 |
Experiment | A very common feature in Chinese word segmentation is the character bigram feature. |
Experiment | Formally, at the i-th character of a sentence cum] , the bigram features are ckck+1(i — 3 < k < z' + 2). |
Introduction | Therefore, we integrate additional simple character bigram features into our model and the result shows that our model can achieve a competitive performance that other systems hardly achieve unless they use more complex task-specific features. |
Related Work | Most previous systems address this task by using linear statistical models with carefully designed features such as bigram features, punctuation information (Li and Sun, 2009) and statistical information (Sun and Xu, 2011). |
A Motivating Example | ( bigrams ) survey+development, develop-ment+civilization |
Feature Expansion | ,w N}, where the elements 212,- are either unigrams or bigrams that appear in the review d. We then represent a review d by a real-valued term-frequency vector d 6 RN , where the value of the j-th element dj is set to the total number of occurrences of the unigram or bigram wj in the review d. To find the suitable candidates to expand a vector d for the review d, we define a ranking score score(ui, d) for each base entry in the thesaurus as follows: |
Feature Expansion | Moreover, we weight the relatedness scores for each word wj by its normalized term-frequency to emphasize the salient unigrams and bigrams in a review. |
Feature Expansion | This is particularly important because we would like to score base entries ui considering all the unigrams and bigrams that appear in a review d, instead of considering each unigram or bigram individually. |
Introduction | a unigram or a bigram of word lemma) in a review using a feature vector. |
Sentiment Sensitive Thesaurus | We select unigrams and bigrams from each sentence. |
Sentiment Sensitive Thesaurus | For the remainder of this paper, we will refer to unigrams and bigrams collectively as lexical elements. |
Sentiment Sensitive Thesaurus | Previous work on sentiment classification has shown that both unigrams and bigrams are useful for training a sentiment classifier (Blitzer et al., 2007). |
Background | This work differs from previous Bayesian models in that we explicitly model a complex backoff path using a hierachical prior, such that our model jointly infers distributions over tag trigrams, bigrams and unigrams and whole words and their character level representation. |
Experiments | mkcls (Och, 1999) 73.7 65.6 MLE 1HMM—LM (Clark, 2003)* 71.2 65.5 BHMM (GG07) 63.2 56.2 PR (Ganchev et al., 2010)* 62.5 54.8 Trigram PYP—HMM 69.8 62.6 Trigram PYP-lHMM 76.0 68.0 Trigram PYP—lHMM-LM 77.5 69.7 Bigram PYP-HMM 66.9 59.2 Bigram PYP— 1HMM 72.9 65.9 Trigram DP—HMM 68.1 60.0 Trigram DP— 1HMM 76.0 68.0 Trigram DP— 1HMM—LM 76.8 69.8 |
Experiments | If we restrict the model to bigrams we see a considerable drop in performance. |
The PYP-HMM | The trigram transition distribution, Tij, is drawn from a hierarchical PYP prior which backs off to a bigram Bj and then a unigram U distribution, |
The PYP-HMM | This allows the modelling of trigram tag sequences, while smoothing these estimates with their corresponding bigram and unigram distributions. |
The PYP-HMM | We formulate the character—level language model as a bigram model over the character sequence comprising word 7111, |
Joint Model | While past extractive methods have assigned value to individual sentences and then explicitly represented the notion of redundancy (Carbonell and Goldstein, 1998), recent methods show greater success by using a simpler notion of coverage: bigrams |
Joint Model | Note that there is intentionally a bigram missing from (a). |
Joint Model | contribute content, and redundancy is implicitly encoded in the fact that redundant sentences cover fewer bigrams (Nenkova and Vanderwende, 2005; Gillick and Favre, 2009). |
Structured Learning | We use bigram recall as our loss function (see Section 3.3). |
Structured Learning | Luckily, our choice of loss function, bigram recall, factors over bigrams . |
Structured Learning | We simply modify each bigram value 2);, to include bigram b’s contribution to the total loss. |
Abstract | In this paper we show that even for the case of 1:1 substitution ciphers—which encipher plaintext symbols by exchanging them with a unique substitute—finding the optimal decipherment with respect to a bigram language model is NP-hard. |
Definitions | similarly define the bigram count N f f/ of f, f ’ E Vf as |
Definitions | (a) N f fl are integer counts > 0 of bigrams found in the ciphertext ffv . |
Definitions | (b) Given the first and last token of the cipher f1 and fN, the bigram counts involving the sentence boundary token $ need to fulfill |
Introduction | Section 5 shows the connection between the quadratic assignment problem and decipherment using a bigram language model. |
Introduction | Most methods have employed some variant of Expectation Maximization (EM) to learn parameters for a bigram |
Introduction | Ravi and Knight (2009) achieved the best results thus far (92.3% word token accuracy) via a Minimum Description Length approach using an integer program (IP) that finds a minimal bigram grammar that obeys the tag dictionary constraints and covers the observed data. |
Minimized models for supertagging | The 1241 distinct supertags in the tagset result in 1.5 million tag bigram entries in the model and the dictionary contains almost 3.5 million word/tag pairs that are relevant to the test data. |
Minimized models for supertagging | The set of 45 P08 tags for the same data yields 2025 tag bigrams and 8910 dictionary entries. |
Minimized models for supertagging | Our objective is to find the smallest supertag grammar (of tag bigram types) that explains the entire text while obeying the lexicon’s constraints. |
Ad hoc rule detection | 3.4 Bigram anomalies 3.4.1 Motivation |
Ad hoc rule detection | The bigram method examines relationships between adjacent sisters, complementing the whole rule method by focusing on local properties. |
Ad hoc rule detection | But only the final elements have anomalous bigrams : HD:ID IR:IR, IR:IR ANzRO, and ANzRO J RzIR all never occur. |
Additional information | This rule is entirely correct, yet the XXzXX position has low whole rule and bigram scores. |
Approach | First, the bigram method abstracts a rule to its bigrams . |
Evaluation | For example, the bigram method with a threshold of 39 leads to finding 283 errors (455 x .622). |
Evaluation | The whole rule and bigram methods reveal greater precision in identifying problematic dependencies, isolating elements with lower UAS and LAS scores than with frequency, along with corresponding greater pre- |
Introduction and Motivation | We propose to flag erroneous parse rules, using information which reflects different grammatical properties: POS lookup, bigram information, and full rule comparisons. |
Fitting the Model | We still give EM the full word/tag dictionary, but now we constrain its initial grammar model to the 459 tag bigrams identified by IP. |
Fitting the Model | In addition to removing many bad tag bigrams from the grammar, IP minimization also removes some of the good ones, leading to lower recall (EM = 0.87, IP+EM = 0.57). |
Fitting the Model | During EM training, the smaller grammar with fewer bad tag bigrams helps to restrict the dictionary model from making too many bad choices that EM made earlier. |
Small Models | That is, there exists a tag sequence that contains 459 distinct tag bigrams , and no other tag sequence contains fewer. |
Small Models | We also create variables for every possible tag bigram and word/tag dictionary entry. |
What goes wrong with EM? | We investigate the Viterbi tag sequence generated by EM training and count how many distinct tag bigrams there are in that sequence. |
Evaluation | In our first set of experiments, we looked at the impact of choosing bigrams over unigrams as our basic unit of representation, along with performance of LP (Eq. |
Evaluation | Table 4 presents the results of these variations; overall, by taking into account generated candidates appropriately and using bigrams (“SLP 2-gram”), we obtained a 1.13 BLEU gain on the test set. |
Evaluation | Using unigrams (“SLP l-gram”) actually does worse than the baseline, indicating the importance of focusing on translations for sparser bigrams . |
Generation & Propagation | Although our technique applies to phrases of any length, in this work we concentrate on unigram and bigram phrases, which provides substantial computational cost savings. |
Generation & Propagation | We only consider target phrases whose source phrase is a bigram , but it is worth noting that the target phrases are of variable length. |
Generation & Propagation | To generate new translation candidates using the baseline system, we decode each unlabeled source bigram to generate its m-best translations. |
Multi-Structure Sentence Compression | Following this, §2.3 discusses a dynamic program to find maximum weight bigram subsequences from the input sentence, while §2.4 covers LP relaxation-based approaches for approximating solutions to the problem of finding a maximum-weight subtree in a graph of potential output dependencies. |
Multi-Structure Sentence Compression | C. In addition, we define bigram indicator variables yij E {0, l} to represent whether a particular order-preserving bigram2 (ti, tj> from S is present as a contiguous bigram in C as well as dependency indicator variables zij E {0, 1} corresponding to whether the dependency arc ti —> 253- is present in the dependency parse of C. The score for a given compression 0 can now be defined to factor over its tokens, n-grams and dependencies as follows. |
Multi-Structure Sentence Compression | where Qtok, ngr and 6dep are feature-based scoring functions for tokens, bigrams and dependencies respectively. |
Abstract | The selection is made according to the appropriateness of the alteration to the query context (using a bigram language model), or according to its expected impact on the retrieval effectiveness (using a regression model). |
Bigram Expansion Model for Alteration Selection | The query context is modeled by a bigram language model as in (Peng et al. |
Bigram Expansion Model for Alteration Selection | In this work, we used bigram language model to calculate the probability of each path. |
Introduction | The query context is modeled by a bigram language model. |
Introduction | which is the most coherent with the bigram model. |
Introduction | We call this model Bigram Expansion. |
Related Work | 2007), a bigram language model is used to determine the alteration of the head word that best fits the query. |
Related Work | In this paper, one of the proposed methods will also use a bigram language model of the query to determine the appropriate alteration candidates. |
Abstract | We take a multi-pass approach to machine translation decoding when using synchronous context-free grammars as the translation model and n-gram language models: the first pass uses a bigram language model, and the resulting parse forest is used in the second pass to guide search with a trigram language model. |
Abstract | The trigram pass closes most of the performance gap between a bigram decoder and a much slower trigram decoder, but takes time that is insignificant in comparison to the bigram pass. |
Decoding to Maximize BLEU | The outside-pass Algorithm 1 for bigram decoding can be generalized to the trigram case. |
Experiments | Hyperedges BLEU Bigram Pass 167K 21.77 Trigram Pass UNI — —BO + 629.7K=796.7K 23.56 BO+BB +2.7K =169. |
Introduction | First, we present a two-pass decoding algorithm, in which the first pass explores states resulting from an integrated bigram language model, and the second pass expands these states into trigram-based |
Introduction | With this heuristic, we achieve the same BLEU scores and model cost as a trigram decoder with essentially the same speed as a bigram decoder. |
Multi-pass LM-Integrated Decoding | More specifically, a bigram decoding pass is executed forward and backward to figure out the probability of each state. |
Multi-pass LM-Integrated Decoding | We take the same view as in speech recognition that a trigram integrated model is a finer-grained model than bigram model and in general we can do an n — l-gram decoding as a predicative pass for the following n-gram pass. |
Multi-pass LM-Integrated Decoding | If we can afford a bigram decoding pass, the outside cost from a bigram model is conceivably a |
Evaluation | The results are shown in figure 1 for the whole daughters scoring method and in figure 2 for the bigram method. |
Evaluation | Figure 2: Bigram ungeneralizability (devo.) |
Rule dissimilarity and generalizability | To do this, we can examine the weakest parts of each rule and compare those across the corpus, to see which anomalous patterns emerge; we do this in the Bigram scoring section below. |
Rule dissimilarity and generalizability | Bigram scoring The other method of detecting ad hoc rules calculates reliability scores by focusing specifically on what the classes do not have in common. |
Rule dissimilarity and generalizability | We abstract to bigrams , including added START and END tags, as longer sequences risk missing generalizations; e. g., unary rules would have no comparable rules. |
Distributed Clustering | With each word w in one of these sets, all words 1) preceding w in the corpus are stored with the respective bigram count N (v, w). |
Distributed Clustering | While the greedy non-distributed exchange algorithm is guaranteed to converge as each exchange increases the log likelihood of the assumed bigram model, this is not necessarily true for the distributed exchange algorithm. |
Exchange Clustering | Beginning with an initial clustering, the algorithm greedily maximizes the log likelihood of a two-sided class bigram or trigram model as described in Eq. |
Exchange Clustering | With N g” and N fuc denoting the average number of clusters preceding and succeeding another cluster, B denoting the number of distinct bigrams in the training corpus, and I denoting the number of iterations, the worst case complexity of the algorithm is in: |
Exchange Clustering | When using large corpora With large numbers of bigrams the number of required updates can increase towards the quadratic upper bound as N?” and N 5% approach NC. |
Predictive Exchange Clustering | Modifying the exchange algorithm in order to optimize the log likelihood of a predictive class bigram model, leads to substantial performance improvements, similar to those previously reported for another type of one-sided class model in (Whittaker and Woodland, 2001). |
Predictive Exchange Clustering | We use a predictive class bigram model as given in Eq. |
Predictive Exchange Clustering | Then the following optimization criterion can be derived, with F(C) being the log likelihood function of the predictive class bigram model given a clustering C: |
Metric Design Considerations | Similarly, we also match the bigrams and trigrams of the sentence pair and calculate their corresponding Fmean scores. |
Metric Design Considerations | where in our experiments, we set N =3, representing calculation of unigram, bigram , and trigram scores. |
Metric Design Considerations | To determine the number matchbi of bigram matches, a system bigram (lsipsi, l8i+1psi+1) matches a reference bigram (lripme +1 pm. |
Experiments | Bigram and trigram performances are similar for Chinese, but trigram performs better for Japanese. |
Experiments | In fact, although the difference in perplexity per character is not so large, the perplexity per word is radically reduced: 439.8 ( bigram ) to 190.1 (trigram). |
Inference | Furthermore, it has an inherent limitation that it cannot deal with larger than bigrams , because it uses only local statistics between directly contiguous words for word segmentation. |
Inference | It has an additional advantage in that we can accommodate higher-order relationships than bigrams , particularly trigrams, for word segmentation. |
Inference | For this purpose, we maintain a forward variable a[t] in the bigram case. |
Introduction | Crucially, since they rely on sampling a word boundary between two neighboring words, they can leverage only up to bigram word dependencies. |
Pitman-Yor process and n-gram models | The bigram distribution G2 = { |
Experiments | We follow prior work and use sets of bigrams within words. |
Experiments | In our case, during bipartite matching the set X is the set of bigrams in the language being re-permuted, and Y is the union of bigrams in the other languages. |
Experiments | Besides the heuristic baseline, we tried our model-based approach using Unigrams, Bigrams and Anchored Unigrams, with and without learning the parametric edit distances. |
Message Approximation | Figure 2: Various topologies for approximating topologies: (a) a unigram model, (b) a bigram model, (c) the anchored uni gram model, and (d) the n-best plus backoff model used in Dreyer and Eisner (2009). |
Message Approximation | The first is a plain unigram model, the second is a bigram model, and the third is an anchored unigram topology: a position-specific unigram model for each position up to some maximum length. |
Message Approximation | The second topology we consider is the bigram topology, illustrated in Figure 2(b). |
A Simple Lagrangian Relaxation Algorithm | We now give a Lagrangian relaxation algorithm for integration of a hypergraph with a bigram language model, in cases where the hypergraph satisfies the following simplifying assumption: |
A Simple Lagrangian Relaxation Algorithm | over the original (non-intersected) hypergraph, with leaf nodes having weights 6v + 0% + (3) If the output derivation from step 2 has the same set of bigrams as those from step 1, then we have an exact solution to the problem. |
A Simple Lagrangian Relaxation Algorithm | C1 states that each leaf in a derivation has exactly one incoming bigram, and that each leaf not in the derivation has 0 incoming bigrams; C2 states that each leaf in a derivation has exactly one outgoing bigram , and that each leaf not in the derivation has 0 outgoing bigrams.6 |
Background: Hypergraphs | Throughout this paper we make the following assumption when using a bigram language model: |
Background: Hypergraphs | Assumption 3.1 ( Bigram start/end assumption.) |
The Full Algorithm | The set ’P of trigram paths plays an analogous role to the set 23 of bigrams in our previous algorithm. |
Training method | We broadly classify features into two categories: unigram and bigram features. |
Training method | TBO (TE (w_1)) T31 <TE(w—1)7P0> T32 <TE(w—1),P—1,P0> TB3 (TE(w_1),TB(w0)) <TE(w—1)7T3(w0)7p0> <TE(w—1)7p—17T3(w0)> T36 <TE(w—1)7p—17TB(w0)7p0> CBO <p_1, p0) otherwise Table 3: Bigram features. |
Training method | Bigram features: Table 3 shows our bigram features. |
A Syntax Free Sequence-oriented Sentence Compression Method | 1 if 1(yj) = (yj—1)+ 1 APLM Bigram (w71(yj)71(yj—1)) (5) otherwise |
A Syntax Free Sequence-oriented Sentence Compression Method | Here, 0 g APLM g l, Bigram(-) indicates word bigram probability. |
A Syntax Free Sequence-oriented Sentence Compression Method | The first line of equation (5) agrees with Jing’s observation on sentence alignment tasks (Jing and McKeown, 1999); that is, most (or almost all) bigrams in a compressed sentence appear in the original sentence as they are. |
Experimental Evaluation | For example, label ‘w/o IPTW + Dep’ employs IDF term weighting as function and word bigram, part-of-speech bigram and dependency probability between words as function in equation (1). |
Results and Discussion | Replacing PLM with the bigram language model (w/o PLM) degrades the performance significantly. |
Results and Discussion | Most bigrams in a compressed sentence followed those in the source sentence. |
Results and Discussion | PLM is similar to dependency probability in that both features emphasize word pairs that occurred as bigrams in the source sentence. |
Simulation 1 | Our reader’s language model was an unsmoothed bigram model created using a vocabulary set con- |
Simulation 1 | From this vocabulary, we constructed a bigram model using the counts from every bigram in the BNC for which both words were in vocabulary (about 222,000 bigrams ). |
Simulation 1 | Specifically, we constructed the model’s initial belief state (i.e., the distribution over sentences given by its language model) by directly translating the bigram model into a wFSA in the log semiring. |
Simulation 2 | Instead, we begin with the same set of bigrams used in Sim. |
Simulation 2 | 1 — i.e., those that contain two in-vocabulary words — and trim this set by removing rare bigrams that occur less than 200 times in the BNC (except that we do not trim any bigrams that occur in our test corpus). |
Simulation 2 | This reduces our set of bigrams to about 19,000. |
System Architecture | To derive word features, first of all, our system automatically collect a list of word unigrams and bigrams from the training data. |
System Architecture | To avoid overfitting, we only collect the word unigrams and bigrams whose frequency is larger than 2 in the training set. |
System Architecture | This list of word unigrams and bigrams are then used as a unigram-dictionary and a bigram-dictionary to generate word-based unigram and bigram features. |
Evaluation | To our surprise, the Fixed Affix model does a slightly better job in reducing out of vocabulary than the Bigram Affix model. |
Evaluation | WoTr 24.21 Bigram Affix Model TRR 25. |
Morphology-based Vocabulary Expansion | We use two different models of morphology expansion in this paper: Fixed Affix model and Bigram Affix model. |
Morphology-based Vocabulary Expansion | 3.2.2 Bigram Affix Expansion Model |
Morphology-based Vocabulary Expansion | In the Bigram Affix model, we do the same for the stem as in the Fixed Affix model, but for prefixes and suffixes, we create a bigram language model in the finite state machine. |
Experiments | Bigram based reordering. |
Experiments | First we consider a bigram Language Model and the algorithms try to find the reordering that maximizes the LM score. |
Experiments | This means that, when using a bigram language model, it is often possible to reorder the words of a randomly permuted reference sentence in such a way that the LM score of the reordered sentence is larger than the LM of the reference. |
Phrase-based Decoding as TSP | o The language model cost of producing the target words of 19’ right after the target words of b; with a bigram language model, this cost can be precomputed directly from b and b’. |
Phrase-based Decoding as TSP | This restriction to bigram models will be removed in Section 4.1. |
Phrase-based Decoding as TSP | 4.1 From Bigram to N-gram LM |
Approaches | The clusters are formed by a greedy hierachi-cal clustering algorithm that finds an assignment of words to classes by maximizing the likelihood of the training data under a latent-class bigram model. |
Approaches | First, for SRL, it has been observed that feature bigrams (the concatenation of simple features such as a predicate’s POS tag and an argument’s word) are important for state-of-the-art (Zhao et al., 2009; Bjorkelund et al., 2009). |
Approaches | We consider both template unigrams and bigrams , combining two templates in sequence. |
Experiments | Each of 1G0 and 1GB also include 32 template bigrams selected by information gain on 1000 sentences—we select a different set of template bigrams for each dataset. |
Experiments | However, the original unigram Bjorkelund features (Bdeflmemh), which were tuned for a high-resource model, obtain higher Fl than our information gain set using the same features in unigram and bigram templates (1GB). |
Experiments | In Czech, we disallowed template bigrams involving path-grams. |
Experiments | Consistent with findings in the literature (Cui et al., 2006; Dave et al., 2003; Gamon and Aue, 2005), on the large corpus of movie review texts, the in-domain-trained system based solely on unigrams had lower accuracy than the similar system trained on bigrams . |
Experiments | But the trigrams fared slightly worse than bigrams . |
Experiments | On sentences, however, we have observed an inverse pattern: unigrams performed better than bigrams and trigrams. |
Factors Affecting System Performance | System runs with unigrams, bigrams , and trigrams as features and with different training set sizes are presented. |
Integrating the Corpus-based and Dictionary-based Approaches | In the ensemble of classifiers, they used a combination of nine SVM-based classifiers deployed to learn unigrams, bigrams , and trigrams on three different domains, while the fourth domain was used as an evaluation set. |
Abstract | We consider bigram and trigram templates for generating potentially deterministic constraints. |
Abstract | bigram constraint includes one contextual word (w_1|w1) or the corresponding morph feature; and a trigram constraint includes both contextual words or their morph features. |
Abstract | precision recall F1 bigram 0.993 0.841 0.911 trigram 0.996 0.608 0.755 |
Abstract | To illustrate, consider the following feature set, a bigram and a trigram (each term in the n-gram either has the form word or Atag): |
Abstract | please AVB and 9 is the number of bigrams in T, excluding sentence initial and final markers. |
Abstract | Unigrams and Bigrams : As a different sort of baseline, we considered the results of a bag-of-words based classifier. |
Experiments | The baselines NB and BINB are Naive Bayes classifiers with, respectively, unigram features and unigram and bigram features. |
Experiments | SVM is a support vector machine with unigram and bigram features. |
Experiments | unigram, bigram , trigram 92.6 MAXENT POS, chunks, NE, supertags |
Introduction | On the hand-labelled test set, the network achieves a greater than 25% reduction in the prediction error with respect to the strongest unigram and bigram baseline reported in Go et al. |
Automated Approaches to Deceptive Opinion Spam Detection | Specifically, we consider the following three n-gram feature sets, with the corresponding features lowercased and unstemmed: UNIGRAMS, BIGRAMS+ , TRIGRAMS+, where the superscript + indicates that the feature set subsumes the preceding feature set. |
Automated Approaches to Deceptive Opinion Spam Detection | We consider all three n-gram feature sets, namely UNIGRAMS, BIGRAMS+ , and TRIGRAMS+, with corresponding language models smoothed using the interpolated Kneser-Ney method (Chen and Goodman, 1996). |
Automated Approaches to Deceptive Opinion Spam Detection | We use SVMlight (Joachims, 1999) to train our linear SVM models on all three approaches and feature sets described above, namely POS, LIWC, UNIGRAMS, BIGRAMS+ , and TRIGRAMS+. |
Conclusion and Future Work | Specifically, our findings suggest the importance of considering both the context (e.g., BIGRAMS+ ) and motivations underlying a deception, rather than strictly adhering to a universal set of deception cues (e.g., LIWC). |
Results and Discussion | This suggests that a universal set of keyword-based deception cues (e.g., LIWC) is not the best approach to detecting deception, and a context-sensitive approach (e.g., BIGRAMS+ ) might be necessary to achieve state-of-the-art deception detection performance. |
Results and Discussion | Additional work is required, but these findings further suggest the importance of moving beyond a universal set of deceptive language features (e. g., LIWC) by considering both the contextual (e. g., BIGRAMS+ ) and motivational parameters underlying a deception as well. |
Complexity Analysis | For the monolingual bigram model, the number of states in the HMM is U times more than that of the monolingual unigram model, as the states at specific position of F are not only related to the length of the current word, but also related to the length of the word before it. |
Complexity Analysis | NPY( bigram )a 0.750 0.802 17 m —NPY(trigram)a 0.757 0.807 |
Complexity Analysis | HDP( bigram )b 0.723 — 10 h —FitnessC — 0.667 — —Prop. |
Related Work | We learn embedding for unigrams, bigrams and trigrams separately with same neural network and same parameter setting. |
Related Work | 25$ employs the embedding of unigrams, bigrams and trigrams separately and conducts the matrix-vector operation of ac on the sequence represented by columns in each lookup table. |
Related Work | Lum, Lbi and Lm are the lookup tables of the unigram, bigram and trigram embedding, respectively. |
Introduction | twitter unigram TTT * YES (54%) twitter bigram TTT * YES (52%) personal uni gram MT * YES (52%) personal bigram — NO (48%) |
Introduction | We measure a tweet’s similarity to expectations by its score according to the relevant language model, fi ZweTlog(p(m)), where T refers to either all the unigrams (unigram model) or all and only bi-grams ( bigram model).16 We trained a Twitter-community language model from our 558M unpaired tweets, and personal language models from each author’s tweet history. |
Introduction | 16The tokens [at], [hashtag], [url] were ignored in the unigram-model case to prevent their undue influence, but retained in the bigram model to capture longer-range usage (“combination”) patterns. |
Experiments | 0 Bigrams: an implementation of the bigram classifier for soft pattern matching proposed by Cui et al. |
Experiments | The probability is calculated as a mixture of bigram and |
Experiments | WCL—l 99.88 42.09 59.22 76.06 WCL—3 98.81 60.74 75.23 83.48 Star patterns 86.74 66.14 75.05 81.84 Bigrams 66.70 82.70 73.84 75.80 |
Conditional Random Fields | In the sequel, we distinguish between two types of feature functions: unigramfea-tures fyflc, associated with parameters Hwy, and bigram features fy/Mgc, associated with parameters Aggy)? |
Conditional Random Fields | On the other hand, bigram features {fy/,y,$}(y,$)€y2xX are helpful in modelling dependencies between successive labels. |
Conditional Random Fields | Assume the set of bigram features {Ag/,y,$t+l}(y/,y)€y2 is sparse with only r(:ct+1) << |Y 2 non null values and define the |Y| >< |Y| sparse matrix |
Experiments | We also kept NPs with only 1 modifier to be used for generating <m0difie1; head n0un> bigram counts at training time. |
Experiments | For example, the NP “the beautiful blue Macedonian vase” generates the following bigrams : <beautzful blue>, <blue Macedonian>, and <beautlful Macedonian>, along with the 3-gram <beautlful blue Macedonian>. |
Experiments | In addition, we also store a table that keeps track of bigram counts for < M, H >, where H is the head noun of an NP and M is the modifier closest to it. |
Related Work | Shaw and Hatzivassiloglou also use a transitivity method to fill out parts of the Count table where bigrams are not actually seen in the training data but their counts can be inferred from other entries in the table, and they use a clustering method to group together modifiers with similar positional preferences. |
Related Work | Shaw and Hatzivassiloglou report a highest accuracy of 94.93% and a lowest accuracy of 65.93%, but since their methods depend heavily on bigram counts in the training corpus, they are also limited in how informed their decisions can be if modifiers in the test data are not present at training time. |
Experiments & Results 4.1 Experimental Setup | The measures are evaluated by fixing the window size to 4 and maximum candidate paraphrase length to 2 (e. g. bigram ). |
Experiments & Results 4.1 Experimental Setup | - - -unigram --- bigram —trigram "" " quadgram |
Experiments & Results 4.1 Experimental Setup | Type Node \ MRR % \ RCL % \ Bipartite unigram 5.2 12.5 bigram 6.8 15.7 Tripartite unigram 5.9 12.6 bigram 6.9 15.9 Baseline bigram 3.9 7.7 |
Approach | (a) Word unigrams (b) Word bigrams |
Approach | (a) PoS unigrams (b) PoS bigrams (c) PoS trigrams |
Approach | Word unigrams and bigrams are lower-cased and used in their inflected forms. |
Previous work | The Bayesian Essay Test Scoring sYstem (BETSY) (Rudner and Liang, 2002) uses multinomial or Bernoulli Naive Bayes models to classify texts into different classes (e. g. pass/fail, grades AF) based on content and style features such as word unigrams and bigrams , sentence length, number of verbs, noun—verb pairs etc. |
Validity tests | (a) word unigrams within a sentence (b) word bigrams within a sentence (c) word trigrams within a sentence |
Methodology | In response to these difficulties in differentiating linguistic registers, we compute two different PMI scores for character-based bigrams from two large corpora representing news and microblogs as features. |
Methodology | In addition, we also convert all the character-based bigrams into Pinyin-based bigrams (ignoring tones5) and compute the Pinyin-level PMI in the same way. |
Methodology | These features capture inconsistent use of the bigram across the two domains, which assists to distinguish informal words. |
Introduction | where we explicitly distinguish the unigram feature function o; and bigram feature function Comparing the form of the two functions, we can see that our discussion on HMMs can be extended to perceptrons by substituting 2k wigbflwwn) and 2k wg¢%(w,yn_1,yn) for logp(:cn|yn) and 10gp(yn|yn—1)- |
Introduction | For bigram features, we compute its upper bound offline. |
Introduction | The simplest case is that the bigram features are independent of the token sequence :13. |
Evaluation framework | The second clique contains query bigrams that match |
Evaluation framework | document bigrams in 2-word ordered windows (‘#I’), A2 = 0.1. |
Evaluation framework | The third clique uses the same bigrams as clique 2 with an 8-word unordered window (‘#uw8’), A3 = 0.05. |
Introduction | Integration of the identified catenae in queries also improves IR effectiveness compared to a highly effective baseline that uses sequential bigrams with no linguistic knowledge. |
Change in Description Length | Suppose that the original sequence VVz-_1 is N -word long, the selected word type pair cc and 3/ each occurs k and 1 times, respectively, and altogether x-y bigram occurs m times in VVz-_1. |
Change in Description Length | In the new sequence Wi, each of the m bigrams is replaced with an unseen word 2 = my. |
Regularized Compression | Hence, a new sequence IV,-is created in the i-th iteration by merging all the occurrences of some selected bigram (cc, 3/) in the original sequence Wi_1. |
Regularized Compression | Note that f(:c, y) is the bigram frequency, |Wi_1| the sequence length of VVi_1, and AH(W,-_1, m) = FI(W,-) — H(W,-_1) is the difference between the empirical Shannon entropy measured on 1% and VVi_1, using maximum likelihood estimates. |
Regularized Compression | In the new sequence 1%, each occurrence of the 53-3; bigram is replaced with a new (conceptually unseen) word 2. |
Experimental Results | Moreover, a bigram (i.e., “2gram”) achieves the best BLEU scores among the four different orders of VMs. |
Experimental Results | This is necessarily true, but it is interesting to see that most of the improvement is obtained just by moving from a unigram to a bigram model. |
Experimental Results | Indeed, although Table 3 shows that better approximations can be obtained by using higher-order models, the best BLEU score in Tables 2a and 2c was obtained by the bigram model. |
Variational Approximate Decoding | However, because q* only approximates p, y* of (13) may be locally appropriate but globally inadequate as a translation of c. Observe, e. g., that an n-gram model q* will tend to favor short strings y, regardless of the length of c. Suppose cc 2 le chat chasse la souris (“the cat chases the mouse”) and q* is a bigram approximation to p(y | Presumably q* (the | START), q* (mouse | the), and q*(END | mouse) are all large in So the most probable string y* under q* may be simply “the mouse,” which is short and has a high probability but fails to cover cc. |
Baseline parser | bigrams sowslw, sowslc, socslw, 300310, sowqow, sowqot, socqow, socqot, qo’wa’w, (1010(1175, (10731110, (Jotmt, slwqow, slwqot, slcqow, slcqot |
Semi-supervised Parsing with Large Data | 2 From the dependency trees, we extract bigram lexical dependencies (2121,2122, L/R) where the symbol L (R) means that w1 (2112) is the head of ’LU2 (wl). |
Semi-supervised Parsing with Large Data | (2009), we assign categories to bigram and trigram items separately according to their frequency counts. |
Semi-supervised Parsing with Large Data | Hereafter, we refer to the bigram and trigram lexical dependency lists as BLD and TLD, respectively. |
Experiments | Phrase extraction and indexing: We evaluate our proposed method on two different types of phrases: syntactic head-modifier pairs (syntactic phrases) and simple bigram phrases (statistical phrases). |
Experiments | Since our method is not limited to a particular type of phrases, we have also conducted experiments on statistical phrases ( bigrams ) with a reduced set of features directed applicable; RMO, RSO, PD5, DF, and CPP; the features requiring linguistic preprocessing (e. g. PPT) are not used, because it is unrealistic to use them under bigram—based retrieval setting. |
Experiments | 5In most cases, the distance between words in a bigram is l, but sometimes, it could be more than 1 because of the effect of stopword removal. |
Experiments and Discussions | We use R-l (recall against unigrams), R-2 (recall against bigrams), and R-SU4 (recall against skip-4 bigrams ). |
Experiments and Discussions | Note that R-2 is a measure of bigram recall and sumHLDA of HybHSumg is built on unigrams rather than bigrams . |
Regression Model | We similarly include bigram features in the experiments. |
Regression Model | We also include bigram extensions of DMF features. |
Regression Model | We use sentence bigram frequency, sentence rank in a document, and sentence size as additional fea- |
Capturing Paradigmatic Relations via Word Clustering | The quality is defined based on a class-based bigram language model as follows. |
Capturing Paradigmatic Relations via Word Clustering | The objective function is maximizing the likelihood H2121 P(w7;|w1, ..., wi_1) of the training data given a partially class-based bigram model of the form |
State-of-the-Art | Word bigrams : w_2_w_1, w_1_w, w_w+1, w+1_w+2; In order to better handle unknown words, we extract morphological features: character n- gram prefixes and suffixes for n up to 3. |
State-of-the-Art | (2009) introduced a bigram HMM model with latent variables (Bi gram HMM-LA in the table) for Chinese tagging. |
State-of-the-Art | Trigram HMM (Huang et al., 2009) 93.99% Bigram HMM-LA (Huang et al., 2009) 94.53% Our tagger 94.69% |
Introduction | 0 The extension of the feature representation used by previous works with bigrams and trigrams and an evaluation of the benefit of using longer keywords in hedge classification. |
Methods | For trigrams, bigrams and unigrams — processed separately — we calculated a new class-conditional probability for each feature cc, discarding those observations of c in speculative instances where c was not among the two highest ranked candidate. |
Results | These keywords were many times used in a mathematical context (referring to probabilities) and thus expressed no speculative meaning, while such uses were not represented in the FlyBase articles (otherwise bigram or trigram features could have captured these non-speculative uses). |
Results | One third of the features used by our advanced model were either bigrams or trigrams. |
Results | Our model using just unigram features achieved a BEP(spec) score of 78.68% and F5=1(spec) score of 80.23%, which means that using bigram and trigram hedge cues here significantly improved the performance (the difference in BEP (spec) and F5=1(spec) scores were 5.23% and 4.97%, respectively). |
A Class-based Model of Agreement | The features are indicators for (character, position, label) triples for a five Character window and bigram label transition indicators. |
A Class-based Model of Agreement | Bigram transition features gbt encode local agreement relations. |
A Class-based Model of Agreement | We trained a simple add-1 smoothed bigram language model over gold class sequences in the same treebank training data: |
MWE-dedicated Features | We use word unigrams and bigrams in order to capture multiwords present in the training section and to extract lexical cues to discover new MWEs. |
MWE-dedicated Features | For instance, the bigram coup de is often the prefix of compounds such as coup de pied (kick), coup de foudre (love at first sight), coup de main (help). |
MWE-dedicated Features | We use part-of-speech unigrams and bigrams in order to capture MWEs with irregular syntactic structures that might indicate the id-iomacity of a word sequence. |
Evaluation | The remaining 93 unlabeled games are used to train unigram, bigram , and trigram grounded language models. |
Evaluation | Only unigrams, bigrams , and tri—grams that are not proper names, appear greater than three times, and are not composed only of stop words were used. |
Evaluation | with traditional unigram, bigram , and trigram language models generated from a combination of the closed captioning transcripts of all training games and data from the switchboard corpus (see below). |
Linguistic Mapping | Estimating bigram and trigram models can be done by processing on word pairs or triples, and performing normalization on the resulting conditional distributions. |
Approach | In our formulation, each hidden state corresponds to an issue or topic, characterized by a distribution over words and bigrams appearing in privacy policy sections addressing that issue. |
Approach | 0,; is generated by repeatedly sampling from a distribution over terms that includes all unigrams and bigrams except those that occur in fewer than 5% of the documents and in more than 98% of the documents. |
Approach | models (e. g., a bigram may be generated by as many as three draws from the emission distribution: once for each unigram it contains and once for the bigram ). |
Experiment | Our second baseline is latent Dirichlet allocation (LDA; Blei et al., 2003), with ten topics and online variational Bayes for inference (Hoffman et al., 2010).7 To more closely match our models, LDA is given access to the same unigram and bigram tokens. |
Experiments | In the case of bigrams , the perpleXities of TheoryZ are almost the same as that of Zipf2 when the size of reduced vocabulary is large. |
Perplexity on Reduced Corpora | This model seems to be stupid, since we can easily notice that the bigram “is is” is quite frequent, and the two bigrams “is a” and “a is” have the same frequency. |
Perplexity on Reduced Corpora | For example, the decay function 92 of bigrams is as follows: |
Perplexity on Reduced Corpora | They pointed out that the exponent of bigrams is about 0.66, and that of 5-grams is about 0.59 in the Wall Street Journal corpus (WSJ 87). |
Predicting Direction of Power | Baseline (Always Superior) 52.54 Baseline (Word Unigrams + Bigrams ) 68.56 THRNCW 55.90 THRPR 54.30 DIAPR 54.05 THRPR + THRNew 61.49 DIAPR + THRPR + THRNew 62.47 LEX 70.74 LEX + DIAPR + THRPR 67.44 LEX + DIAPR + THRPR + THRNew 68.56 BEST (= LEX + THRNeW) 73.03 BEST (Using p1 features only) 72.08 BEST (Using IMt features only) 72.11 BEST (Using Mt only) 71.27 BEST (No Indicator Variables) 72.44 |
Predicting Direction of Power | We found the best setting to be using both unigrams and bigrams for all three types of ngrams, by tuning in our dev set. |
Predicting Direction of Power | We also use a stronger baseline using word unigrams and bigrams as features, which obtained an accuracy of 68.6%. |
Conclusion | We have presented a noisy-channel model that simultaneously learns a lexicon, a bigram language model, and a model of phonetic variation, while using only the noisy surface forms as training data. |
Introduction | Previous models with similar goals have learned from an artificial corpus with a small vocabulary (Driesen et al., 2009; Rasanen, 2011) or have modeled variability only in vowels (Feldman et al., 2009); to our knowledge, this paper is the first to use a naturalistic infant-directed corpus while modeling variability in all segments, and to incorporate word-level context (a bigram language model). |
Introduction | Our model is conceptually similar to those used in speech recognition and other applications: we assume the intended tokens are generated from a bigram language model and then distorted by a noisy channel, in particular a log-linear model of phonetic variability. |
Related work | In contrast, our model uses a symbolic representation for sounds, but models variability in all segment types and incorporates a bigram word-level language model. |
Joint POS Tagging and Parsing with Nonlocal Features | (Collins and Koo, 2005) (Charniak and Johnson, 2005) Rules CoPar HeadTree Bigrams CoLenPar |
Joint POS Tagging and Parsing with Nonlocal Features | Grandparent Bigrams Heavy |
Joint POS Tagging and Parsing with Nonlocal Features | Lexical Bigrams Neighbours |
Count distributions | For example, suppose that our data consists of the following bigrams , with their weights: |
Word Alignment | That is, during the E step, we calculate the distribution of C(e, f) for each e and f, and during the M step, we train a language model on bigrams e f using expected KN smoothing (that is, with u = e and w = f). |
Word Alignment | (The latter case is equivalent to a backoff language model, where, since all bigrams are known, the lower-order model is never used.) |
Word Alignment | This is much less of a problem in KN smoothing, where p’ is estimated from bigram types rather than bigram tokens. |
Learning what to transliterate | From the stat section we collect statistics as to how often every word, bigram or trigram occurs, and what distribution of name/non-name patterns these ngrams have. |
Learning what to transliterate | The name distribution bigram |
Learning what to transliterate | stat corpus bitext, the first word is a marked up as a non-name (”0”) and the second as a name (”1”), which strongly suggests that in such a bigram context, aljzyre better be translated as island or peninsula, and not be transliterated as Al-Jazeera. |
Transliterator | The same consonant skeleton indexing process is applied to name bigrams (47,700,548 unique with 167,398,054 skeletons) and trigrams (46,543,712 unique with 165,536,451 skeletons). |
Analysis | with a bigram HMM with four language clusters. |
Inference | where n(t) and n(t, t’) are, respectively, unigram and bigram tag counts excluding those containing character w. Conversely, n’(t) and n’(t,t’) are, respectively, unigram and bigram tag counts only including those containing character w. The notation am] denotes the ascending factorial: a(a + l) - - - (a +n — 1). |
Inference | where n(j, 19,25) and n(j, 19,75, 25’) are the numbers of languages currently assigned to cluster k which have more than j occurrences of unigram (t) and bigram (t, t’ ), respectively. |
Model | We note that in practice, we implemented a trigram version of the model,2 but we present the bigram version here for notational clarity. |
Reranking Features | Bigram . |
Reranking Features | Indicates whether a given bigram of nonterminal/terminals occurs for given a parent nonterminal: f (L1 —> L2 : L3) = l. |
Reranking Features | Grandparent Bigram . |
Approach | Bigrams (B) - the text is represented as a bag of bigrams (larger n-grams did not help). |
Approach | Generalized bigrams (Bg) - same as above, but the words are generalized to their WNSS. |
Experiments | The first chosen feature is the translation probability computed between the B 9 question and answer representations ( bigrams with words generalized to their WNSS tags). |
Experiments | This is caused by the fact that the BM25 formula is less forgiving with errors of the NLP processors (due to the high idf scores assigned to bigrams and dependencies), and the WNSS tagger is the least robust component in our pipeline. |
Computing Feature Expectations | Hyper-edges (boxes) are annotated with normalized transition probabilities, as well as the bigrams produced by each rule application. |
Computing Feature Expectations | The expected count of the bigram “man with” is the sum of posterior probabilities of the two hyper-edges that produce it. |
Computing Feature Expectations | 5For example, decoding under a variational approximation to the model’s posterior that decomposes over bigram probabilities is equivalent to fast consensus decoding with |
Discussion | In the figure, phone bigram TF—IDF is labeled p2; phonetic alignment with dynamic programming is labeled DP. |
Experiments | The TF-IDF features used in the experiments are based on phone bigrams . |
Feature functions | In practice, we only consider n-grams of a certain order (e. g., bigrams ). |
Feature functions | Then for the bi-gram /1 iy/, we have TF/liy/(fo) = 1/5 (one out of five bigrams in 1—9), and IDF /1 iy / = log(2 / 1) (one word out of two in the dictionary). |
Clustering-based word representations | The Brown algorithm is a hierarchical clustering algorithm which clusters words to maximize the mutual information of bigrams (Brown et al., 1992). |
Clustering-based word representations | So it is a class-based bigram language model. |
Clustering-based word representations | One downside of Brown clustering is that it is based solely on bigram statistics, and does not consider word usage in a wider context. |
Experiments | Web page hits for word pairs and trigrams are obtained using a simple heuristic query to the search engine Google.11 Inflected queries are performed by expanding a bigram or trigram into all its morphological forms. |
Experiments | Although Google hits is noisier, it has very much larger coverage of bigrams or trigrams. |
Experiments | This means that if pages indexed by Google doubles, then so do the bigrams or trigrams frequencies. |
Related Work | Keller and Lapata (2003) evaluated the utility of using web search engine statistics for unseen bigram . |
Experiments | All of the three smoothing methods for bigram and trigram LMs are examined both using back-off mod- |
Pinyin Input Method Model | The edge weight the negative logarithm of conditional probability P(Sj+1,k SM) that a syllable Sm- is followed by Sj+1,k, which is give by a bigram language model of pinyin syllables: |
Pinyin Input Method Model | WE(W,j—>Vj+1,k) : _10g P(Vj+1vk Vivi) Although the model is formulated on first order HMM, i.e., the LM used for transition probability is a bigram one, it is easy to extend the model to take advantage of higher order n-gram LM, by tracking longer history while traversing the graph. |
Syllabification with Structured SVMs | For example, the bigram bl frequently occurs within a single English syllable, while the bigram lb generally straddles two syllables. |
Syllabification with Structured SVMs | Thus, in addition to the single-letter features outlined above, we also include in our representation any bigrams , trigrams, four-grams, and five-grams that fit inside our context window. |
Syllabification with Structured SVMs | As is apparent from Figure 2, we see a substantial improvement by adding bigrams to our feature set. |
Experiments | Local instances NonLocal instances Rule 10, 85 1 ParentRule 18, 019 Word 20, 328 WProj 27, 417 WordEdges 454, 101 Heads 70, 013 CoLenPar 22 HeadTree 67, 836 Bigram <> 10, 292 Heavy 1, 401 Trigram<> 24, 677 NGramTree 67, 559 HeadMod<> 12, 047 RightBranch 2 DistMod<> 16, 017 Total Feature Instances: 800, 582 |
Experiments | We also restricted NGramTree to be on bigrams only. |
Forest Reranking | 2 (d) returns the minimum tree fragement spanning a bigram , in this case “saw” and “the”, and should thus be computed at the smallest common ancestor of the two, which is the VP node in this example. |
Conclusion and future work | We then investigated adaptor grammars that incorporate one additional kind of information, and found that modeling collocations provides the greatest improvement in word segmentation accuracy, resulting in a model that seems to capture many of the same interword dependencies as the bigram model of Goldwater et al. |
Word segmentation with adaptor grammars | It is not possible to write an adaptor grammar that directly implements Goldwater’s bigram word segmentation model because an adaptor grammar has one DP per adapted nonterminal (so the number of DPs is fixed in advance) while Goldwater’s bigram model has one DP per word type, and the number of word types is not known in advance. |
Word segmentation with adaptor grammars | This suggests that the collocation word adaptor grammar can capture inter-word dependencies similar to those that improve the performance of Goldwater’s bigram segmentation model. |
Background 2.1 Dependency parsing | The algorithm then repeatedly merges the pair of clusters which causes the smallest decrease in the likelihood of the text corpus, according to a class-based bigram language model defined on the word clusters. |
Conclusions | To begin, recall that the Brown clustering algorithm is based on a bigram language model. |
Feature design | (2005a), and consists of indicator functions for combinations of words and parts of speech for the head and modifier of each dependency, as well as certain contextual tokens.1 Our second-order baseline features are the same as those of Carreras (2007) and include indicators for triples of part of speech tags for sibling interactions and grandparent interactions, as well as additional bigram features based on pairs of words involved these higher-order interactions. |
Experiments | Excluding qt_1 2 qt bigrams (leading to 0.32M frames from 2.39M frames in “all”) offers a glimpse of expected performance differences were duration modeling to be included in the models. |
Limitations and Desiderata | To produce Figures 1 and 2, a small fraction of probability mass was reserved for unseen bigram transitions (as opposed to backing off to unigram probabilities). |
The Extended-Degree-of-Overlap Model | The EDO model mitigates R-specificity because it models each bigram (qt_1, qt) 2 (8,, S j) as the modified bigram (m, [0ij,nj]), involving three scalars each of which is a sum — a commutative (and therefore rotation-invariant) operation. |
Automatic Metaphor Recognition | They use hyponymy relation in WordNet and word bigram counts to predict metaphors at a sentence level. |
Automatic Metaphor Recognition | Hereby they calculate bigram probabilities of verb-noun and adjective-noun pairs (including the hyponyms/hypernyms of the noun in question). |
Automatic Metaphor Recognition | However, by using bigram counts over verb-noun pairs Krishnakumaran and Zhu (2007) loose a great deal of information compared to a system extracting verb-object relations from parsed text. |
Models 2.1 Baseline Models | After CRF based recovery of the suffix tag sequence, we use a bigram language model trained on a full segmented version on the training data to recover the original vowels. |
Models 2.1 Baseline Models | We used bigrams only, because the suffix vowel harmony alternation depends only upon the preceding phonemes in the word from which it was segmented. |
Models 2.1 Baseline Models | original training data: koskevaa mietintoa kasitellaan segmentation: koske+ +va+ +a mietinto+ +5 kasi+ +te+ +115+ +a+ +n (train bigram language model with mapping A = { a, 'a map final suflia‘ to abstract tag-set: koske+ +va+ +A mietinto+ +A kasi+ +te+ +11a+ +a+ +n (train CRF model to predict the final suffix) peeling of final suflia‘: koske+ +va+ mietinto+ kasi+ +te+ +11a+ +a+ (train SMT model on this transformation of training data) (a) Training |
Experiments | In particular, we use the unigrams of the current and its neighboring words, word bigrams, prefixes and suffixes of the current word, capitalization, all-number, punctuation, and tag bigrams for POS, CoNLL2000 and CoNLL 2003 datasets. |
Experiments | For supertag dataset, we use the same features for the word inputs, and the unigrams and bigrams for gold POS inputs. |
Problem formulation | Bigram features are of form fk (yt, yt_1, xt) which are concerned with both the previous and the current labels. |
Experiments | Besides unigram and bigram , the most effective textual feature is URL. |
Proposed Features | 3.1.1 Unigrams and Bigrams The most common type of feature for text classi- |
Proposed Features | feature selection method X2 (Yang and Pedersen, 1997) to select the top 200 unigrams and bigrams as features. |
Datasets | Their features come from the Linguistic Inquiry and Word Count lexicon (LIWC) (Pennebaker et al., 2001), as well as from lists of “sticky bigrams” (Brown et al., 1992) strongly associated with one party or another (e. g., “illegal aliens” implies conservative, “universal healthcare” implies liberal). |
Datasets | We first extract the subset of sentences that contains any words in the LIWC categories of Negative Emotion, Positive Emotion, Causation, Anger, and Kill verbs.3 After computing a list of the top 100 sticky bigrams for each category, ranked by log-likelihood ratio, and selecting another subset from the original data that included only sentences containing at least one sticky bigram , we take the union of the two subsets. |
Related Work | They use an HMM-based model, defining the states as a set of fine-grained political ideologies, and rely on a closed set of lexical bigram features associated with each ideology, inferred from a manually labeled ideological books corpus. |
Experimental Design | Consecutive Word/Bigram/Trigram This feature family targets adjacent repetitions of the same word, bigram or trigram, e.g., ‘show me the show me the |
Problem Formulation | The weight of this rule is the bigram probability of two records conditioned on their type, multiplied with a normalization factor 7». |
Problem Formulation | Rule (6) defines the expansion of field F to a sequence of (binarized) words W, with a weight equal to the bigram probability of the current word given the previous word, the current record, and field. |
Structure-based Stacking | 0 Character unigrams: ck (i — l S k: S i + l) 0 Character bigrams : ckck+1 (i — l S k: < i + l) |
Structure-based Stacking | 0 Character label bigrams : cgpdcgffi (i — lppd S |
Structure-based Stacking | 0 Bigram features: C(sk)C(sk+1) (i — [C S k; < 73 + lo), Tctb(5k)Tctb(3k+1) (i — 1ng g k; < i +1390), Tppd(5k)Tppd(3k+1) (73 — lgpd S k: < 73+ zgpd) |
Compressive Summarization | (2011), we used stemmed word bigrams as concepts, to which we associate the following concept features ((DCOV): indicators for document counts, features indicating if each of the words in the bigram is a stop-word, the earliest position in a document each concept occurs, as well as two and three-way conjunctions of these features. |
Experiments | We generated oracle extracts by maximizing bigram recall with respect to the manual abstracts, as described in Berg-Kirkpatrick et al. |
Extractive Summarization | 1Previous work has modeled concepts as events (Filatova and Hatzivassiloglou, 2004), salient words (Lin and Bilmes, 2010), and word bigrams (Gillick etal., 2008). |
Using subcategorization information | The model has access to the basic features stem and tag, as well as the new features based on subcat-egorizaion information (explained below), using unigrams within a Window of up to four positions to the right and the left of the current position, as well as bigrams and trigrams for stems and tags (current item + left and/or right item). |
Using subcategorization information | In addition to the probability/frequency of the respective functions, we also provide the CRF with bigrams containing the two parts of the tuple, |
Using subcategorization information | By providing the parts of the tuple as unigrams, bigrams or trigrams to the CRF, all relevant information is available: verb, noun and the probabilities for the potential functions of the noun in the sentence. |
Introduction | Knowing that the back-transliterated un-igram “blacki” and bigram “blacki shred” are unlikely in English can promote the correct WS, jfiafi‘yi/z/I/yh“ “blackish red”. |
Use of Language Model | As the English LM, we used Google Web 1T 5-gram Version 1 (Brants and Franz, 2006), limiting it to unigrams occurring more than 2000 times and bigrams occurring more than 500 times. |
Word Segmentation Model | We limit the features to word unigram and bigram features, i'e'7 My) 2 Zil¢1<wi) + ¢2(7~Ui—1awi)l for y = wl...wn. |
Results | The results in Table 5 use the official ROUGE software with standard options5 and report ROUGE-2 (R-2) (measures bigram overlap) and ROUGE-SU4 (R-SU4) (measures unigram and skip-bigram separated by up to four words). |
The Framework | unigram/bigram/skip bigram (at most four words apart) overlap unigram/bigram TF/TF—IDF similarity |
The Framework | 2011; Ouyang et al., 2011), we use the ROUGE-2 score, which measures bigram overlap between a sentence and the abstracts, as the objective for regression. |
Introduction | For example, if a bigram parameter is modified due to the presence of some set of trigrams, and then some or all of those trigrams are pruned from the model, the bigram associated with the modified parameter will be unlikely to have an overall expected frequency equal to its observed frequency anymore. |
Marginal distribution constraints | Thus the unigram distribution is with respect to the bigram model, the bigram model is with respect to the trigram model, and so forth. |
Model constraint algorithm | This can be particularly clearly seen at the unigram state, which has an arc for every unigram (the size of the vocabulary): for every bigram state (also order of the vocabulary), in the naive algorithm we must look for every possible arc. |
Phrase Ranking based on Relevance | This thread of research models bigrams by encoding them into the generative process. |
Phrase Ranking based on Relevance | For each word, a topic is sampled first, then its status as a unigram or bigram is sampled, and finally the word is sampled from a topic-specific unigram or bigram distribution. |
Phrase Ranking based on Relevance | In (Tomokiyo and Hurst, 2003), a language model approach is used for bigram phrase extraction. |
Task A: Polarity Classification | We studied the influence of unigrams, bigrams and a combination of the two, and saw that the best performing feature set consists of the combination of unigrams and bigrams . |
Task A: Polarity Classification | In this paper, we will refer from now on to n-grams as the combination of unigrams and bigrams . |
Task B: Valence Prediction | Those include n-grams (unigrams, bigrams and combination of the two), LIWC scores. |