Distribution Prediction | For this purpose, we represent a word 21) using unigrams and bigrams that co-occur with w in a sentence as follows. |
Distribution Prediction | Next, we generate bigrams of word lemmas and remove any bigrams that consists only of stop words. |
Distribution Prediction | Bigram features capture negations more accurately than unigrams, and have been found to be useful for sentiment classification tasks. |
Models for Measuring Grammatical Competence | Then, regarding POS bigrams as terms, they construct POS-based vector space models for each score-class (there are four score classes denoting levels of proficiency as will be explained in Section 5.2), thus yielding four score-specific vector-space models (VSMs). |
Models for Measuring Grammatical Competence | 0 0034: the cosine similarity score between the test response and the vector of POS bigrams for the highest score class (level 4); and, |
Models for Measuring Grammatical Competence | First, the VSM-based method is likely to overestimate the contribution of the POS bigrams when highly correlated bigrams occur as terms in the VSM. |
Related Work | In order to avoid the problems encountered with deep analysis-based measures, Yoon and Bhat (2012) explored a shallow analysis-based approach, based on the assumption that the level of grammar sophistication at each proficiency level is reflected in the distribution of part-of-speech (POS) tag bigrams . |
Shallow-analysis approach to measuring syntactic complexity | The measures of syntactic complexity in this approach are POS bigrams and are not obtained by a deep analysis (syntactic parsing) of the structure of the sentence. |
Shallow-analysis approach to measuring syntactic complexity | In a shallow-analysis approach to measuring syntactic complexity, we rely on the distribution of POS bigrams at every profi- |
Shallow-analysis approach to measuring syntactic complexity | Consider the two sentence fragments below taken from actual responses (the bigrams of interest and their associated POS tags are boldfaced). |
Experiment | Model PKU MSRA Best05(Chen et al., 2005) 95.0 96.0 Best05(Tseng et al., 2005) 95.0 96.4 (Zhang et al., 2006) 95.1 97.1 (Zhang and Clark, 2007) 94.5 97.2 (Sun et al., 2009) 95.2 97.3 (Sun et al., 2012) 95.4 97.4 (Zhang et al., 2013) 96.1 97.4 MMTNN 94.0 94.9 MMTNN + bigram 95.2 97.2 |
Experiment | A very common feature in Chinese word segmentation is the character bigram feature. |
Experiment | Formally, at the i-th character of a sentence cum] , the bigram features are ckck+1(i — 3 < k < z' + 2). |
Introduction | Therefore, we integrate additional simple character bigram features into our model and the result shows that our model can achieve a competitive performance that other systems hardly achieve unless they use more complex task-specific features. |
Related Work | Most previous systems address this task by using linear statistical models with carefully designed features such as bigram features, punctuation information (Li and Sun, 2009) and statistical information (Sun and Xu, 2011). |
Multi-Structure Sentence Compression | Following this, §2.3 discusses a dynamic program to find maximum weight bigram subsequences from the input sentence, while §2.4 covers LP relaxation-based approaches for approximating solutions to the problem of finding a maximum-weight subtree in a graph of potential output dependencies. |
Multi-Structure Sentence Compression | C. In addition, we define bigram indicator variables yij E {0, l} to represent whether a particular order-preserving bigram2 (ti, tj> from S is present as a contiguous bigram in C as well as dependency indicator variables zij E {0, 1} corresponding to whether the dependency arc ti —> 253- is present in the dependency parse of C. The score for a given compression 0 can now be defined to factor over its tokens, n-grams and dependencies as follows. |
Multi-Structure Sentence Compression | where Qtok, ngr and 6dep are feature-based scoring functions for tokens, bigrams and dependencies respectively. |
Evaluation | In our first set of experiments, we looked at the impact of choosing bigrams over unigrams as our basic unit of representation, along with performance of LP (Eq. |
Evaluation | Table 4 presents the results of these variations; overall, by taking into account generated candidates appropriately and using bigrams (“SLP 2-gram”), we obtained a 1.13 BLEU gain on the test set. |
Evaluation | Using unigrams (“SLP l-gram”) actually does worse than the baseline, indicating the importance of focusing on translations for sparser bigrams . |
Generation & Propagation | Although our technique applies to phrases of any length, in this work we concentrate on unigram and bigram phrases, which provides substantial computational cost savings. |
Generation & Propagation | We only consider target phrases whose source phrase is a bigram , but it is worth noting that the target phrases are of variable length. |
Generation & Propagation | To generate new translation candidates using the baseline system, we decode each unlabeled source bigram to generate its m-best translations. |
Approaches | The clusters are formed by a greedy hierachi-cal clustering algorithm that finds an assignment of words to classes by maximizing the likelihood of the training data under a latent-class bigram model. |
Approaches | First, for SRL, it has been observed that feature bigrams (the concatenation of simple features such as a predicate’s POS tag and an argument’s word) are important for state-of-the-art (Zhao et al., 2009; Bjorkelund et al., 2009). |
Approaches | We consider both template unigrams and bigrams , combining two templates in sequence. |
Experiments | Each of 1G0 and 1GB also include 32 template bigrams selected by information gain on 1000 sentences—we select a different set of template bigrams for each dataset. |
Experiments | However, the original unigram Bjorkelund features (Bdeflmemh), which were tuned for a high-resource model, obtain higher Fl than our information gain set using the same features in unigram and bigram templates (1GB). |
Experiments | In Czech, we disallowed template bigrams involving path-grams. |
Evaluation | To our surprise, the Fixed Affix model does a slightly better job in reducing out of vocabulary than the Bigram Affix model. |
Evaluation | WoTr 24.21 Bigram Affix Model TRR 25. |
Morphology-based Vocabulary Expansion | We use two different models of morphology expansion in this paper: Fixed Affix model and Bigram Affix model. |
Morphology-based Vocabulary Expansion | 3.2.2 Bigram Affix Expansion Model |
Morphology-based Vocabulary Expansion | In the Bigram Affix model, we do the same for the stem as in the Fixed Affix model, but for prefixes and suffixes, we create a bigram language model in the finite state machine. |
Experiments | The baselines NB and BINB are Naive Bayes classifiers with, respectively, unigram features and unigram and bigram features. |
Experiments | SVM is a support vector machine with unigram and bigram features. |
Experiments | unigram, bigram , trigram 92.6 MAXENT POS, chunks, NE, supertags |
Introduction | On the hand-labelled test set, the network achieves a greater than 25% reduction in the prediction error with respect to the strongest unigram and bigram baseline reported in Go et al. |
Complexity Analysis | For the monolingual bigram model, the number of states in the HMM is U times more than that of the monolingual unigram model, as the states at specific position of F are not only related to the length of the current word, but also related to the length of the word before it. |
Complexity Analysis | NPY( bigram )a 0.750 0.802 17 m —NPY(trigram)a 0.757 0.807 |
Complexity Analysis | HDP( bigram )b 0.723 — 10 h —FitnessC — 0.667 — —Prop. |
Related Work | We learn embedding for unigrams, bigrams and trigrams separately with same neural network and same parameter setting. |
Related Work | 25$ employs the embedding of unigrams, bigrams and trigrams separately and conducts the matrix-vector operation of ac on the sequence represented by columns in each lookup table. |
Related Work | Lum, Lbi and Lm are the lookup tables of the unigram, bigram and trigram embedding, respectively. |
Introduction | twitter unigram TTT * YES (54%) twitter bigram TTT * YES (52%) personal uni gram MT * YES (52%) personal bigram — NO (48%) |
Introduction | We measure a tweet’s similarity to expectations by its score according to the relevant language model, fi ZweTlog(p(m)), where T refers to either all the unigrams (unigram model) or all and only bi-grams ( bigram model).16 We trained a Twitter-community language model from our 558M unpaired tweets, and personal language models from each author’s tweet history. |
Introduction | 16The tokens [at], [hashtag], [url] were ignored in the unigram-model case to prevent their undue influence, but retained in the bigram model to capture longer-range usage (“combination”) patterns. |
Count distributions | For example, suppose that our data consists of the following bigrams , with their weights: |
Word Alignment | That is, during the E step, we calculate the distribution of C(e, f) for each e and f, and during the M step, we train a language model on bigrams e f using expected KN smoothing (that is, with u = e and w = f). |
Word Alignment | (The latter case is equivalent to a backoff language model, where, since all bigrams are known, the lower-order model is never used.) |
Word Alignment | This is much less of a problem in KN smoothing, where p’ is estimated from bigram types rather than bigram tokens. |
Joint POS Tagging and Parsing with Nonlocal Features | (Collins and Koo, 2005) (Charniak and Johnson, 2005) Rules CoPar HeadTree Bigrams CoLenPar |
Joint POS Tagging and Parsing with Nonlocal Features | Grandparent Bigrams Heavy |
Joint POS Tagging and Parsing with Nonlocal Features | Lexical Bigrams Neighbours |
Approach | In our formulation, each hidden state corresponds to an issue or topic, characterized by a distribution over words and bigrams appearing in privacy policy sections addressing that issue. |
Approach | 0,; is generated by repeatedly sampling from a distribution over terms that includes all unigrams and bigrams except those that occur in fewer than 5% of the documents and in more than 98% of the documents. |
Approach | models (e. g., a bigram may be generated by as many as three draws from the emission distribution: once for each unigram it contains and once for the bigram ). |
Experiment | Our second baseline is latent Dirichlet allocation (LDA; Blei et al., 2003), with ten topics and online variational Bayes for inference (Hoffman et al., 2010).7 To more closely match our models, LDA is given access to the same unigram and bigram tokens. |
Predicting Direction of Power | Baseline (Always Superior) 52.54 Baseline (Word Unigrams + Bigrams ) 68.56 THRNCW 55.90 THRPR 54.30 DIAPR 54.05 THRPR + THRNew 61.49 DIAPR + THRPR + THRNew 62.47 LEX 70.74 LEX + DIAPR + THRPR 67.44 LEX + DIAPR + THRPR + THRNew 68.56 BEST (= LEX + THRNeW) 73.03 BEST (Using p1 features only) 72.08 BEST (Using IMt features only) 72.11 BEST (Using Mt only) 71.27 BEST (No Indicator Variables) 72.44 |
Predicting Direction of Power | We found the best setting to be using both unigrams and bigrams for all three types of ngrams, by tuning in our dev set. |
Predicting Direction of Power | We also use a stronger baseline using word unigrams and bigrams as features, which obtained an accuracy of 68.6%. |
Experiments | In the case of bigrams , the perpleXities of TheoryZ are almost the same as that of Zipf2 when the size of reduced vocabulary is large. |
Perplexity on Reduced Corpora | This model seems to be stupid, since we can easily notice that the bigram “is is” is quite frequent, and the two bigrams “is a” and “a is” have the same frequency. |
Perplexity on Reduced Corpora | For example, the decay function 92 of bigrams is as follows: |
Perplexity on Reduced Corpora | They pointed out that the exponent of bigrams is about 0.66, and that of 5-grams is about 0.59 in the Wall Street Journal corpus (WSJ 87). |
Experiments | All of the three smoothing methods for bigram and trigram LMs are examined both using back-off mod- |
Pinyin Input Method Model | The edge weight the negative logarithm of conditional probability P(Sj+1,k SM) that a syllable Sm- is followed by Sj+1,k, which is give by a bigram language model of pinyin syllables: |
Pinyin Input Method Model | WE(W,j—>Vj+1,k) : _10g P(Vj+1vk Vivi) Although the model is formulated on first order HMM, i.e., the LM used for transition probability is a bigram one, it is easy to extend the model to take advantage of higher order n-gram LM, by tracking longer history while traversing the graph. |
Datasets | Their features come from the Linguistic Inquiry and Word Count lexicon (LIWC) (Pennebaker et al., 2001), as well as from lists of “sticky bigrams” (Brown et al., 1992) strongly associated with one party or another (e. g., “illegal aliens” implies conservative, “universal healthcare” implies liberal). |
Datasets | We first extract the subset of sentences that contains any words in the LIWC categories of Negative Emotion, Positive Emotion, Causation, Anger, and Kill verbs.3 After computing a list of the top 100 sticky bigrams for each category, ranked by log-likelihood ratio, and selecting another subset from the original data that included only sentences containing at least one sticky bigram , we take the union of the two subsets. |
Related Work | They use an HMM-based model, defining the states as a set of fine-grained political ideologies, and rely on a closed set of lexical bigram features associated with each ideology, inferred from a manually labeled ideological books corpus. |