Distribution Prediction | For this purpose, we represent a word 21) using unigrams and bigrams that co-occur with w in a sentence as follows. |
Distribution Prediction | Using a standard stop word list, we filter out frequent non-content unigrams and select the remainder as unigram features to represent a sentence. |
Distribution Prediction | Bigram features capture negations more accurately than unigrams , and have been found to be useful for sentiment classification tasks. |
Previous Work | They learned unigram language models (LMs) for specific time periods and scored articles with log-likelihood ratio scores. |
Previous Work | Kanhabua and Norvag (2008; 2009) extended this approach with the same model, but expanded its unigrams with POS tags, collocations, and tf-idf scores. |
Previous Work | As above, they learned unigram LMs, but instead measured the KL-divergence between a document and a time period’s LM. |
Timestamp Classifiers | The unigrams w are lowercased tokens. |
Timestamp Classifiers | model as the Unigram NLLR. |
Timestamp Classifiers | Followup work by Kanhabua and Norvag (2008) applied two filtering techniques to the unigrams in the model: |
Experimental Setup | 256,873 unique unigrams and 4,494,222 unique bigrams. |
Experimental Setup | We cluster unigrams (i = l) and bigrams (i = 2). |
Experimental Setup | For all experiments, |l31| = |l32| (except in cases where |l32| exceeds the number of unigrams , see below). |
Models | The parameters d’, d”, and d’” are the discounts for unigrams , bigrams and trigrams, respectively, as defined by Chen and Goodman (1996, p. 20, (26)). |
Models | 232) is the set of unigram (resp. |
Models | We cluster bigram histories and unigram histories separately and write 193 (7.03 |w1w2) for the bigram cluster model and pB(w3|w2) for the unigram cluster model. |
Related work | symbol | denotation 2w (sum over all unigrams w) |
Abstract | We show how to model the task of inferring which objects are being talked about (and which words refer to which objects) as standard grammatical inference, and describe PCFG-based unigram models and adaptor grammar-based collocation models for the task. |
Introduction | The unigram model we describe below corresponds most closely to the Frank |
Introduction | 2.1 Topic models and the unigram PCFG |
Introduction | This leads to our first model, the unigram grammar, which is a PCFG.1 |
Experiments 4.1 The data | Best performance for both the Unigram and the Bigram model in the GOLD-p condition is achieved under the left-right setting, in line with the standard analyses of /t/-deleti0n as primarily being determined by the preceding and the following context. |
Experiments 4.1 The data | For the LEARN-p condition, the Bigram model still performs best in the left-right setting but the Unigram model’s performance drops |
Experiments 4.1 The data | Unigram |
Introduction | We find that models that capture bigram dependencies between underlying forms provide considerably more accurate estimates of those probabilities than corresponding unigram or “bag of words” models of underlying forms. |
The computational model | Our models build on the Unigram and the Bigram model introduced in Goldwater et al. |
The computational model | Figure 1 shows the graphical model for our joint Bigram model (the Unigram case is trivially recovered by generating the Ums directly from L rather than from LUi,j_1). |
A Motivating Example | ( unigrams ) opment, civilization |
Feature Expansion | ,w N}, where the elements 212,- are either unigrams or bigrams that appear in the review d. We then represent a review d by a real-valued term-frequency vector d 6 RN , where the value of the j-th element dj is set to the total number of occurrences of the unigram or bigram wj in the review d. To find the suitable candidates to expand a vector d for the review d, we define a ranking score score(ui, d) for each base entry in the thesaurus as follows: |
Feature Expansion | Moreover, we weight the relatedness scores for each word wj by its normalized term-frequency to emphasize the salient unigrams and bigrams in a review. |
Feature Expansion | This is particularly important because we would like to score base entries ui considering all the unigrams and bigrams that appear in a review d, instead of considering each unigram or bigram individually. |
Introduction | a unigram or a bigram of word lemma) in a review using a feature vector. |
Sentiment Sensitive Thesaurus | We select unigrams and bigrams from each sentence. |
Sentiment Sensitive Thesaurus | For the remainder of this paper, we will refer to unigrams and bigrams collectively as lexical elements. |
Sentiment Sensitive Thesaurus | Previous work on sentiment classification has shown that both unigrams and bigrams are useful for training a sentiment classifier (Blitzer et al., 2007). |
Experiments | Besides the heuristic baseline, we tried our model-based approach using Unigrams, Bigrams and Anchored Unigrams , with and without learning the parametric edit distances. |
Learning | To find this maximizer for any given 7m, we need to find a marginal distribution over the edges connecting any two languages a and d. With this distribution, we calculate the expected “alignment unigrams.” That is, for each pair of phonemes cc and y (or empty phoneme 5), we need to find the quantity: |
Message Approximation | In the context of transducers, previous authors have focused on a combination of n-best lists and unigram back-off models (Dreyer and Eisner, 2009), a schematic diagram of which is in Figure 2(d). |
Message Approximation | Figure 2: Various topologies for approximating topologies: (a) a unigram model, (b) a bigram model, (c) the anchored uni gram model, and (d) the n-best plus backoff model used in Dreyer and Eisner (2009). |
Message Approximation | Another is to choose 7'(w) to be a unigram language model over the language in question with a geometric probability over lengths. |
Experiments and Discussions | We use R-l (recall against unigrams ), R-2 (recall against bigrams), and R-SU4 (recall against skip-4 bigrams). |
Experiments and Discussions | Note that R-2 is a measure of bigram recall and sumHLDA of HybHSumg is built on unigrams rather than bigrams. |
Regression Model | (I) nGram Meta-Features (NMF): For each document cluster D, we identify most frequent (nonstop word) unigrams, i.e., vfreq {wiflzl C V, where 7“ is a model parameter of number of most frequent unigram features. |
Regression Model | We measure observed unigram probabilities for each 212,- E vfreq with pD(wi) = nD(w,-)/ 2'32, 7mm), where nD(w,-) is the number of times 212,- appears in D and |V| is the total number of unigrams . |
Regression Model | To characterize this feature, we reuse the 7“ most frequent unigrams , i.e., w,- E vfreq. |
Tree-Based Sentence Scoring | * sparse unigram distributions (siml) at each topic I on com: similarity between p(w0m,l 17 Com: vl) and p(wsn,l zsn : 17 Com: vl) |
Tree-Based Sentence Scoring | — siml: We define two sparse (discrete) unigram distributions for candidate 0m and summary 3,, at each node Z on a vocabulary identified with words generated by the topic at that node, v; C V. Given wom = {2111, ...,wl0m|}, let WOW; C wom be the set of words in am that are generated from topic zom at level I on path com. |
Tree-Based Sentence Scoring | The discrete unigram distribution pom; = p(w0m,l zom = l, cowvl) represents the probability over all words 2); assigned to topic zom at level 1, by sampling only for words in woml. |
Abstract | The evaluation of computer-generated text is a notoriously difficult problem, however, the quality of image descriptions has typically been measured using unigram BLEU and human judgements. |
Abstract | We estimate the correlation of unigram and Smoothed BLEU, TER, ROUGE-SU4, and Meteor against human judgements on two data sets. |
Abstract | The main finding is that unigram BLEU has a weak correlation, and Meteor has the strongest correlation with human judgements. |
Introduction | The main finding of our analysis is that TER and unigram BLEU are weakly corre- |
Methodology | Unigram BLEU without a brevity penalty has been reported by Kulkarni et a1. |
Methodology | (2011) to perform a sentence-level analysis, setting n = 1 and no brevity penalty to get the unigram BLEU measure, or n = 4 with the brevity penalty to get the Smoothed BLEU measure. |
Methodology | We set dskip = 4 and award partial credit for unigram only matches, otherwise known as ROUGE-SU4. |
Introduction | In Section 3, we theoretically derive the tradeoff formulae of the cutoff for unigram models, k-gram models, and topic models, each of which represents its perplexity with respect to a reduced vocabulary, under the assumption that the corpus follows Zipf’s law. |
Perplexity on Reduced Corpora | 3.1 Perplexity of Unigram Models |
Perplexity on Reduced Corpora | Let us consider the perplexity of a unigram model learned from a reduced corpus. |
Perplexity on Reduced Corpora | In unigram models, a predictive distribution 19’ on a reduced corpus w’ can be simply calculated as p’ (w’) = f (w’ ) / N’ . |
Automatic Evaluation Metrics | Then, unigram matching is performed on the remaining words that are not matched using paraphrases. |
Automatic Evaluation Metrics | Based on the matches, ParaEval will then elect to use either unigram precision or unigram recall as its score for the sentence pair. |
Automatic Evaluation Metrics | Based on the number of word or unigram matches and the amount of string fragmentation represented by the alignment, METEOR calculates a score for the pair of strings. |
Experiments | Consistent with findings in the literature (Cui et al., 2006; Dave et al., 2003; Gamon and Aue, 2005), on the large corpus of movie review texts, the in-domain-trained system based solely on unigrams had lower accuracy than the similar system trained on bigrams. |
Experiments | On sentences, however, we have observed an inverse pattern: unigrams performed better than bigrams and trigrams. |
Experiments | Due to lower frequency of higher-order n-grams (as opposed to unigrams ), higher-order n-gram language models are more sparse, which increases the probability of missing a particular sentiment marker in a sentence (Table 33). |
Factors Affecting System Performance | System runs with unigrams , bigrams, and trigrams as features and with different training set sizes are presented. |
Lexicon-Based Approach | One of the limitations of general lexicons and dictionaries, such as WordNet (Fellbaum, 1998), as training sets for sentiment tagging systems is that they contain only definitions of individual words and, hence, only unigrams could be effectively learned from dictionary entries. |
Lexicon-Based Approach | Since the structure of WordNet glosses is fairly different from that of other types of corpora, we developed a system that used the list of human-annotated adjectives from (Hatzivassiloglou and McKeown, 1997) as a seed list and then learned additional unigrams |
Word segmentation with adaptor grammars | Figure l: The unigram word adaptor grammar, which uses a unigram model to generate a sequence of words, where each word is a sequence of phonemes. |
Word segmentation with adaptor grammars | 3.1 Unigram word adaptor grammar |
Word segmentation with adaptor grammars | (2007a) presented an adaptor grammar that defines a unigram model of word segmentation and showed that it performs as well as the unigram DP word segmentation model presented by (Goldwater et al., 2006a). |
Introduction | twitter unigram TTT * YES (54%) twitter bigram TTT * YES (52%) personal uni gram MT * YES (52%) personal bigram — NO (48%) |
Introduction | We measure a tweet’s similarity to expectations by its score according to the relevant language model, fi ZweTlog(p(m)), where T refers to either all the unigrams ( unigram model) or all and only bi-grams (bigram model).16 We trained a Twitter-community language model from our 558M unpaired tweets, and personal language models from each author’s tweet history. |
Introduction | headline unigram TT YES (53%) headline bigram TTTT * YES (52%) |
Related Work | We learn embedding for unigrams , bigrams and trigrams separately with same neural network and same parameter setting. |
Related Work | The contexts of unigram (bigram/trigram) are the surrounding unigrams (bigrams/trigrams), respectively. |
Related Work | 25$ employs the embedding of unigrams , bigrams and trigrams separately and conducts the matrix-vector operation of ac on the sequence represented by columns in each lookup table. |
Model | For notational convenience, we use terms to denote both words ( unigrams ) and phrases (n-grams). |
Phrase Ranking based on Relevance | Topics in most topic models like LDA are usually unigram distributions. |
Phrase Ranking based on Relevance | For each word, a topic is sampled first, then its status as a unigram or bigram is sampled, and finally the word is sampled from a topic-specific unigram or bigram distribution. |
Phrase Ranking based on Relevance | Yet another thread of research post-processes the discovered topical unigrams to form multi-word phrases using likelihood scores (Blei and Lafferty, 2009). |
Background | This work differs from previous Bayesian models in that we explicitly model a complex backoff path using a hierachical prior, such that our model jointly infers distributions over tag trigrams, bigrams and unigrams and whole words and their character level representation. |
Experiments | Note that the bigram PYP-HMM outperforms the closely related BHMM (the main difference being that we smooth tag bigrams with unigrams ). |
The PYP-HMM | The trigram transition distribution, Tij, is drawn from a hierarchical PYP prior which backs off to a bigram Bj and then a unigram U distribution, |
The PYP-HMM | This allows the modelling of trigram tag sequences, while smoothing these estimates with their corresponding bigram and unigram distributions. |
The PYP-HMM | That is, each table at one level is equivalent to a customer at the next deeper level, creating the invari-ants: Kh} = n;- andKu—i 2 715, where u = tl_1 indicates the unigram backoff context of h. The recursion terminates at the lowest level where the base distribution is static. |
Experiments | We train the OVR classifier on three sets of features, LI WC, Unigram , and POS.9 |
Experiments | In particular, the three-class classifier is around 65% accurate at distinguishing between Employee, Customer, and Tarker for each of the domains using Unigram , significantly higher than random guess. |
Experiments | Best performance is achieved on Unigram features, constantly outperforming LIWC and POS features in both three-class and two-class settings in the hotel domain. |
Introduction | In the examples in Table l, we trained a linear SVM classifier on Ott’s Chicago-hotel dataset on unigram features and tested it on a couple of different domains (the details of data acquisition are illustrated in Section 3). |
Introduction | Table 1: SVM performance on datasets for a classifier trained on Chicago hotel review based on Unigram feature. |
Definitions | Given a ciphertext ffv , we define the unigram count Nf off 6 Vf as1 |
Definitions | Similarly, we define language model matrices S for the unigram and the bigram case. |
Definitions | The unigram language model Sf is defined as |
Introduction | In Section 4 we show that decipherment using a unigram language model corresponds to solving a linear sum assignment problem (LSAP). |
Conclusion | However, oovs can be considered as n-grams (phrases) instead of unigrams . |
Conclusion | In this scenario, we also can look for paraphrases and translations for phrases containing oovs and add them to the phrase-table as new translations along with the translations for unigram oovs. |
Experiments & Results 4.1 Experimental Setup | Table 4: Intrinsic results of different types of graphs when using unigram nodes on Europarl. |
Experiments & Results 4.1 Experimental Setup | Type Node \ MRR % \ RCL % \ Bipartite unigram 5.2 12.5 bigram 6.8 15.7 Tripartite unigram 5.9 12.6 bigram 6.9 15.9 Baseline bigram 3.9 7.7 |
Experiments & Results 4.1 Experimental Setup | Table 5: Results on using unigram or bigram nodes. |
Experiments | The baselines NB and BINB are Naive Bayes classifiers with, respectively, unigram features and unigram and bigram features. |
Experiments | SVM is a support vector machine with unigram and bigram features. |
Experiments | unigram , POS, head chunks 91.0 |
Introduction | On the hand-labelled test set, the network achieves a greater than 25% reduction in the prediction error with respect to the strongest unigram and bigram baseline reported in Go et al. |
Reranking Features | Long-range Unigram . |
Reranking Features | in the parse tree: f(L2 «a left) = l and f(L4 «A turn) 2 l. Two-level Long-range Unigram . |
Reranking Features | Unigram . |
Abstract | Previous work in traditional text classification and its variants — such as sentiment analysis — has achieved successful results by using the bag-of-words representation; that is, by treating text as a collection of words with no interdependencies, training a classifier on a large feature set of word unigrams which appear in the corpus. |
Abstract | Few of these tactics would be effectively encapsulated by word unigrams . |
Abstract | Many would be better modeled by POS tag unigrams (with no word information) or by longer n-grams consisting of either words, POS tags, or a combination of the two. |
Complexity Analysis | For the monolingual bigram model, the number of states in the HMM is U times more than that of the monolingual unigram model, as the states at specific position of F are not only related to the length of the current word, but also related to the length of the word before it. |
Complexity Analysis | Thus its complexity is U 2 times the unigram model’s complexity: |
Complexity Analysis | ( unigram ) 0.729 0.804 3 s 50 s Prop. |
Methods | This section uses a unigram model for description convenience, but the method can be extended to n-gram models. |
Term and Document Frequency Statistics | Figure 4: Difference between observed and predicted IDFw for Tagalog unigrams . |
Term and Document Frequency Statistics | 3.1 Unigram Probabilities |
Term and Document Frequency Statistics | We encounter the burstiness property of words again by looking at unigram occurrence probabilities. |
Smoothing on count distributions | p’(w | u’) For the example above, the estimates for the unigram model p’(w) are p'(cat) = x 0.489 p'(dog) = x 0.511. |
Smoothing on count distributions | For the example above, the count distributions used for the unigram distribution would be: ‘ r = 0 r = l p(c(cat) = r) 0.14 0.86 p(c(dog) = r) 0.1 0.9 |
Smoothing on integral counts | Absolute discounting chooses p’(w | u’) to be the maximum-likelihood unigram distribution; under KN smoothing (Kneser and Ney, 1995), it is chosen to make [9 in (2) satisfy the following constraint for all (n — l)—grams u’w: |
Word Alignment | Following common practice in language modeling, we use the unigram distribution p( f ) as the lower-order distribution. |
Word Alignment | As shown in Table l, for KN smoothing, interpolation with the unigram distribution performs the best, while for WB smoothing, interestingly, interpolation with the uniform distribution performs the best. |
Word Alignment | In WB smoothing, p’( f) is the empirical unigram distribution. |
Evaluation | In our first set of experiments, we looked at the impact of choosing bigrams over unigrams as our basic unit of representation, along with performance of LP (Eq. |
Evaluation | Using unigrams (“SLP l-gram”) actually does worse than the baseline, indicating the importance of focusing on translations for sparser bigrams. |
Evaluation | 6It is relatively straightforward to combine both unigrams and bi grams in one source graph, but for experimental clarity we did not mix these phrase lengths. |
Generation & Propagation | Although our technique applies to phrases of any length, in this work we concentrate on unigram and bigram phrases, which provides substantial computational cost savings. |
Introduction | Unlike previous work (Irvine and Callison-Burch, 2013a; Razmara et al., 2013), we use higher order n-grams instead of restricting to unigrams , since our approach goes beyond OOV mitigation and can enrich the entire translation model by using evidence from monolingual text. |
Related Work | Recent improvements to BLI (Tamura et al., 2012; Irvine and Callison-Burch, 2013b) have contained a graph-based flavor by presenting label propagation-based approaches using a seed lexicon, but evaluation is once again done on top-1 or top-3 accuracy, and the focus is on unigrams . |
Related Work | (2013) and Irvine and Callison-Burch (2013a) conduct a more extensive evaluation of their graph-based BLI techniques, where the emphasis and end-to-end BLEU evaluations concentrated on OOVs, i.e., unigrams , and not on enriching the entire translation model. |
Inference | While previous work used p(l<:|@) = (l —p($))k_1p($), this is only true for unigrams . |
Inference | the number of tables tew for w in word unigrams . |
Nested Pitman-Yor Language Model | Thus far we have assumed that the unigram G1 is already given, but of course it should also be generated as G1 ~ PY(G0, d, 6). |
Nested Pitman-Yor Language Model | 2Note that this is different from unigrams , which are posterior distribution given data. |
Nested Pitman-Yor Language Model | When a word 21) is generated from its parent at the unigram node, it means that w |
Pitman-Yor process and n-gram models | Suppose we have a unigram word distribution G1 = } where - ranges over each word in the lexicon. |
Pitman-Yor process and n-gram models | In this representation, each n-gram context h (including the null context 6 for unigrams ) is a Chinese restaurant whose customers are the n-gram counts seated over the tables 1 - - -thw. |
Approach | (a) Word unigrams (b) Word bigrams |
Approach | (a) PoS unigrams (b) PoS bigrams (c) PoS trigrams |
Approach | Word unigrams and bigrams are lower-cased and used in their inflected forms. |
Previous work | The Bayesian Essay Test Scoring sYstem (BETSY) (Rudner and Liang, 2002) uses multinomial or Bernoulli Naive Bayes models to classify texts into different classes (e. g. pass/fail, grades AF) based on content and style features such as word unigrams and bigrams, sentence length, number of verbs, noun—verb pairs etc. |
Validity tests | (a) word unigrams within a sentence (b) word bigrams within a sentence (c) word trigrams within a sentence |
Conclusions | Our finding that token unigram features are capable of solving the task accurately agrees with the the results of previous works on hedge classification ((Light et al., 2004), (Med- |
Methods | For trigrams, bigrams and unigrams — processed separately — we calculated a new class-conditional probability for each feature cc, discarding those observations of c in speculative instances where c was not among the two highest ranked candidate. |
Results | About half of these were the kind of phrases that had no unigram components of themselves in the feature set, so these could be regarded as meaningful standalone features. |
Results | Our model using just unigram features achieved a BEP(spec) score of 78.68% and F5=1(spec) score of 80.23%, which means that using bigram and trigram hedge cues here significantly improved the performance (the difference in BEP (spec) and F5=1(spec) scores were 5.23% and 4.97%, respectively). |
Results | Our experiments revealed that in radiology reports, which mainly concentrate on listing the identified diseases and symptoms (facts) and the physician’s impressions (speculative parts), detecting hedge instances can be performed accurately using unigram features. |
Data and Task | Notice the high lexical overlap between the two sentences ( unigram overlap of 100% in one direction and 72% in the other). |
Data and Task | 19 is another true paraphrase pair with much lower lexical overlap ( unigram overlap of 50% in one direction and 30% in the other). |
Experimental Evaluation | (2006), using features calculated directly from 51 and 52 without recourse to any hidden structure: proportion of word unigram matches, proportion of lemma-tized unigram matches, BLEU score (Papineni et al., 2001), BLEU score on lemmatized tokens, F measure (Turian et al., 2003), difference of sentence length, and proportion of dependency relation overlap. |
Experimental Evaluation | 10This is accomplished by eliminating lines 12 and 13 from the definition of pm and redefining pword to be the unigram word distribution estimated from the Gigaword corpus, as in G0, without the help of WordNet. |
QG for Paraphrase Modeling | (15) Here aw is the Good-Turing unigram probability estimate of a word 21) from the Gigaword corpus (Graff, 2003). |
QG for Paraphrase Modeling | As noted, the distributions pm, the word unigram weights in Eq. |
System Architecture | To derive word features, first of all, our system automatically collect a list of word unigrams and bigrams from the training data. |
System Architecture | To avoid overfitting, we only collect the word unigrams and bigrams whose frequency is larger than 2 in the training set. |
System Architecture | This list of word unigrams and bigrams are then used as a unigram-dictionary and a bigram-dictionary to generate word-based unigram and bigram features. |
Experiments | 6This is because the weights of unigram to trigram features in a loglinear CRF model is a balanced consequence for maximization. |
Experiments | A unigram feature might end up with lower weight because another trigram containing this unigram gets a higher weight. |
Experiments | Then we would have missed this feature if we only used top unigram features. |
Method | Unigram QA Model The QA system uses up to trigram features (Table 1 shows examples of unigram and bigram features). |
Method | We drop this strict constraint (which may need further smoothing) and only use unigram features, not by simply extracting “good” unigram features from the trained model, but by retraining the model with only unigram features. |
Algorithms | When constructing a summary, we update the unigram distribution of the constructed summary so that it includes a smoothed distribution of the previous summaries in order to eliminate redundancy between the successive steps in the chain. |
Algorithms | For example, when we summarize the documents that were retrieved as a result to the first query, we calculate the unigram distribution in the same manner as we did in Focused KLSum; but for the second query, we calculate the unigram distribution as if all the sentences we selected for the previous summary were selected for the current query too, with a damping factor. |
Algorithms | In this variant, the Unigram Distribution estimate of word X is computed as: |
Previous Work | KLSum adopts a language model approach to compute relevance: the documents in the input set are modeled as a distribution over words (the original algorithm uses a unigram distribution over the bag of words in documents D). |
Previous Work | KLSum is a sentence extraction algorithm: it searches for a subset of the sentences in D with a unigram distribution as similar as possible to that of the overall collection D, but with a limited length. |
Previous Work | After the words are classified, the algorithm uses a KLSum variant to find the summary that best matches the unigram distribution of topic specific words. |
Evaluation | The remaining 93 unlabeled games are used to train unigram , bigram, and trigram grounded language models. |
Evaluation | Only unigrams , bigrams, and tri—grams that are not proper names, appear greater than three times, and are not composed only of stop words were used. |
Evaluation | with traditional unigram , bigram, and trigram language models generated from a combination of the closed captioning transcripts of all training games and data from the switchboard corpus (see below). |
Linguistic Mapping | 3 In the discussion that follows, we describe a method for estimating unigram grounded language models. |
Approach | 0,; is generated by repeatedly sampling from a distribution over terms that includes all unigrams and bigrams except those that occur in fewer than 5% of the documents and in more than 98% of the documents. |
Approach | models (e. g., a bigram may be generated by as many as three draws from the emission distribution: once for each unigram it contains and once for the bigram). |
Evaluation | We derived unigram tfidf vectors for each section in each of 50 randomly sampled policies per category. |
Experiment | The implementation uses unigram features and cosine similarity. |
Experiment | Our second baseline is latent Dirichlet allocation (LDA; Blei et al., 2003), with ten topics and online variational Bayes for inference (Hoffman et al., 2010).7 To more closely match our models, LDA is given access to the same unigram and bigram tokens. |
Experiments | Besides unigram and bigram, the most effective textual feature is URL. |
Proposed Features | 3.1.1 Unigrams and Bigrams The most common type of feature for text classi- |
Proposed Features | feature selection method X2 (Yang and Pedersen, 1997) to select the top 200 unigrams and bigrams as features. |
Proposed Features | The top ten unigrams related to deceptive answers are shown on Table 1. |
Experimental results | Note that unigrams in the models are never pruned, hence all models assign probabilities over an identical vocabulary and perplexity is comparable across models. |
Marginal distribution constraints | Thus the unigram distribution is with respect to the bigram model, the bigram model is with respect to the trigram model, and so forth. |
Model constraint algorithm | Thus we process each history length in descending order, finishing with the unigram state. |
Model constraint algorithm | This can be particularly clearly seen at the unigram state, which has an arc for every unigram (the size of the vocabulary): for every bigram state (also order of the vocabulary), in the naive algorithm we must look for every possible arc. |
Experiments | In particular, we use the unigrams of the current and its neighboring words, word bigrams, prefixes and suffixes of the current word, capitalization, all-number, punctuation, and tag bigrams for POS, CoNLL2000 and CoNLL 2003 datasets. |
Experiments | For supertag dataset, we use the same features for the word inputs, and the unigrams and bigrams for gold POS inputs. |
Problem formulation | To simplify the discussion, we divide the features into two groups: unigram label features and bi-gram label features. |
Problem formulation | Unigram features are of form fk(yt, xt) which are concerned with the current label and arbitrary feature patterns from input sequence. |
Extensions of SemPOS | For the purposes of the combination, we compute BLEU only on unigrams up to fourgrams (denoted BLEUl, ..., BLEU4) but including the brevity penalty as usual. |
Extensions of SemPOS | This is also confirmed by the observation that using BLEU alone is rather unreliable for Czech and BLEU-l (which judges unigrams only) is even worse. |
Problems of BLEU | Fortunately, there are relatively few false positives in n-gram based metrics: 6.3% of unigrams and far fewer higher n-grams. |
Problems of BLEU | This amounts to 34% of running unigrams , giving enough space to differ in human judgments and still remain unscored. |
Approaches | We consider both template unigrams and bigrams, combining two templates in sequence. |
Approaches | Constructing all feature template unigrams and bigrams would yield an unwieldy number of features. |
Experiments | Our primary feature set IGC consists of 127 template unigrams that emphasize coarse properties (i.e., properties 7, 9, and 11 in Table 1). |
Experiments | However, the original unigram Bjorkelund features (Bdeflmemh), which were tuned for a high-resource model, obtain higher Fl than our information gain set using the same features in unigram and bigram templates (1GB). |
Conclusions and Future Work | We model and harness lexical correlations using translation models, in the company of unigram language models that are used to characterize reply posts, and formulate a clustering-based EM approach for solution identification. |
Introduction | We model the lexical correlation and solution post character using regularized translation models and unigram language models respectively. |
Our Approach | Consider a unigram language model 83 that models the lexical characteristics of solution posts, and a translation model 73 that models the lexical correlation between problems and solutions. |
Our Approach | Consider the post and reply vocabularies to be of sizes A and B respectively; then, the translation model would have A x B variables, whereas the unigram language model has only B variables. |
Predicting Direction of Power | Baseline (Always Superior) 52.54 Baseline (Word Unigrams + Bigrams) 68.56 THRNCW 55.90 THRPR 54.30 DIAPR 54.05 THRPR + THRNew 61.49 DIAPR + THRPR + THRNew 62.47 LEX 70.74 LEX + DIAPR + THRPR 67.44 LEX + DIAPR + THRPR + THRNew 68.56 BEST (= LEX + THRNeW) 73.03 BEST (Using p1 features only) 72.08 BEST (Using IMt features only) 72.11 BEST (Using Mt only) 71.27 BEST (No Indicator Variables) 72.44 |
Predicting Direction of Power | We found the best setting to be using both unigrams and bigrams for all three types of ngrams, by tuning in our dev set. |
Predicting Direction of Power | We also use a stronger baseline using word unigrams and bigrams as features, which obtained an accuracy of 68.6%. |
Training method | Table 2: Unigram features. |
Training method | We broadly classify features into two categories: unigram and bigram features. |
Training method | Unigram features: Table 2 shows our unigram features. |
Corpus Details | As our reference algorithm, we used the current state-of-the-art system developed by Boulis and Ostendorf (2005) using unigram and bigram features in a SVM framework. |
Corpus Details | For each conversation side, a training example was created using unigram and bigram features with tf-idf weighting, as done in standard text classification approaches. |
Corpus Details | Also, named entity “Mike” shows up as a discriminative unigram , this maybe due to the self-introduction at the beginning of the conversations and “Mike” being a common male name. |
Conditional Random Fields | Using only unigram features {fy,$}(y,$)€y>< X results in a model equivalent to a simple bag-of-tokens position-by-position logistic regression model. |
Conditional Random Fields | The same idea can be used when the set {My,$t+1}yey of unigram features is sparse. |
Conditional Random Fields | The features used in Nettalk experiments take the form fyflu ( unigram ) and fy/ww (bigram), where w is a n-gram of letters. |
Experimental results and discussions 6.1 Baseline experiments | Another is that BC utilizes a rich set of features to characterize a given spoken sentence while LM is constructed solely on the basis of the lexical ( unigram ) information. |
Experimental setup 5.1 Data | They are, respectively, the ROUGE-l ( unigram ) measure, the ROUGE-2 (bigram) measure and the ROUGE-L (longest common subsequence) measure (Lin, 2004). |
Proposed Methods | In the LM approach, each sentence in a document can be simply regarded as a probabilistic generative model consisting of a unigram distribution (the so-called “bag-0f-words” assumption) for generating the document (Chen et al., 2009): (w) |
Proposed Methods | To mitigate this potential defect, a unigram probability estimated from a general collection, which models the general distribution of words in the target language, is often used to smooth the sentence model. |
Cue Discovery for Content Selection | .xm} consists of m unigram features representing the observed vocabulary used in our corpus. |
Experimental Results | We use a binary unigram feature space, and we perform 7-fold cross-va1idation. |
Prediction | One challenge of this approach is our underlying unigram feature space - tree-based algorithms are generally poor classifiers for the high-dimensionality, low-information features in a lexical feature space (Han et al., 2001). |
Prediction | splits than would unigrams alone. |
Experiments | We use the Lemur toolkit (Ogilvie and Callan, 2001) version 4.11 as the basic retrieval tool, and select the default unigram LM approach based on KL-divergence and Dirichlet-prior smoothing method in Lemur as our basic retrieval approach. |
The Language Modeling Approach to IR | The most commonly used language model in IR is the unigram model, in which terms are assumed to be independent of each other. |
The Language Modeling Approach to IR | In the rest of this paper, language model will refer to the unigram language model. |
The Language Modeling Approach to IR | With unigram model, the negative KL-divergence between model 6.1 of query (1 and model 6d of document d is calculated as follows: |
New Sense Indicators | As such, we compute unigram log probabilities (via smoothed relative frequencies) of each word under consideration in the old domain and the new domain. |
New Sense Indicators | However, we do not simply want to capture unusual words, but words that are unlikely in context, so we also need to look at the respective unigram log probabilities: 635' and Eflgw. |
New Sense Indicators | From these four values, we compute corpus-level (and therefore type-based) statistics of the new domain n-gram log probability (Eflgw, the difference between the n-gram probabilities in each domain (623” — 6:51), the difference between the n-gram and unigram probabilities in the new domain (EQSW — 633‘”), and finally the combined difference: 623"” — [SSW + 63:: — 635’). |
Introduction | where we explicitly distinguish the unigram feature function o; and bigram feature function Comparing the form of the two functions, we can see that our discussion on HMMs can be extended to perceptrons by substituting 2k wigbflwwn) and 2k wg¢%(w,yn_1,yn) for logp(:cn|yn) and 10gp(yn|yn—1)- |
Introduction | For unigram features, we compute the maximum, maxy 2k wligbflmw), as a preprocess in |
Introduction | In POS tagging, we used unigrams of the current and its neighboring words, word bigrams, prefixes and suffixes of the current word, capitalization, and tag bigrams. |
Why Does Unsimplified Data Help? | This is particularly important for unigrams (i.e. |
Why Does Unsimplified Data Help? | Table 3 shows the percentage of unigrams , bigrams and trigrams from the two test sets that are found in the simple and normal training data. |
Why Does Unsimplified Data Help? | Even at the unigram level, the normal data contained significantly more of the test set unigrams than the simple data. |
Features | We also count the number of unigrams the two transcripts have in common and the length, absolute and relative, of the longest unigram overlap. |
Features | In addition, we look at the number of characters and unigrams and the audio duration of each query, with the intuition that the length of a query may be correlated with its likelihood of being retried (or a retry). |
Prediction task | T-tests between the two categories showed that all edit distance features—character, word, reduced, and phonetic; raw and normalized—are significantly more similar between retry query pairs.1 Similarly, the number of unigrams the two queries have in common is significantly higher for retries. |
Experiments | The DISCRIMINATIVE baseline for this task is a standard maximum entropy discriminative binary classifier over unigrams . |
Model | Global Distributions: At the global level, we draw several unigram distributions: a global background distribution 63 and attribute distributions 6% for each attribute. |
Model | Product Level: For the ith product, we draw property unigram distributions 6351, . |
Experiments | We used a simple measure for isolating the syntactic likelihood of a sentence: we take the log-probability under our model and subtract the log-probability under a unigram model, then normalize by the length of the sentence.8 This measure, which we call the syntactic log-odds ratio (SLR), is a crude way of “subtracting out” the semantic component of the generative probability, so that sentences that use rare words are not penalized for doing so. |
Experiments | (2004) also report using a parser probability normalized by the unigram probability (but not length), and did not find it effective. |
Treelet Language Modeling | p(w|P, R, r’, w_1, w_2) to p(w|P, R, r’, w_1) and then p(w|P, R, r’ From there, we back off to p(w|P, R) where R is the sibling immediately to the right of P, then to a raw PCFG p(w|P), and finally to a unigram distribution. |
Structure-based Stacking | 0 Character unigrams : ck (i — l S k: S i + l) 0 Character bigrams: ckck+1 (i — l S k: < i + l) |
Structure-based Stacking | 0 Character label unigrams : cgpd (i—lppd S k: 3 73+ zppd) |
Structure-based Stacking | 0 Unigram features: C(sk) (i — l0 3 k: S +l0), Tctb(3k) (i — 1351) S k? |
Computing Feature Expectations | where h(t) is the unigram prefix of bigram t. |
Consensus Decoding Algorithms | where T1 is the set of unigrams in the language, and 6(6, 25) is an indicator function that equals 1 if 75 appears in e and 0 otherwise. |
Consensus Decoding Algorithms | Figure 1: For the linear similarity measure U (c; e’ ), which computes unigram precision, the MBR translation can be found by iterating either over sentence pairs (Algorithm 1) or over features (Algorithm 2). |
Automated Approaches to Deceptive Opinion Spam Detection | Specifically, we consider the following three n-gram feature sets, with the corresponding features lowercased and unstemmed: UNIGRAMS , BIGRAMS+, TRIGRAMS+, where the superscript + indicates that the feature set subsumes the preceding feature set. |
Automated Approaches to Deceptive Opinion Spam Detection | We consider all three n-gram feature sets, namely UNIGRAMS , BIGRAMS+, and TRIGRAMS+, with corresponding language models smoothed using the interpolated Kneser-Ney method (Chen and Goodman, 1996). |
Automated Approaches to Deceptive Opinion Spam Detection | We use SVMlight (Joachims, 1999) to train our linear SVM models on all three approaches and feature sets described above, namely POS, LIWC, UNIGRAMS , BIGRAMS+, and TRIGRAMS+. |
Experimental Setup | The mapping of sentence labels to phrase labels was unsupervised: if the phrase came from a sentence labeled (1), and there was a unigram overlap (excluding stop words) between the phrase and any of the original highlights, we marked this phrase with a positive label. |
Experimental Setup | Our feature set comprised surface features such as sentence and paragraph position information, POS tags, unigram and bigram overlap with the title, and whether high-scoring tf.idf words were present in the phrase (66 features in total). |
Experimental Setup | We report unigram overlap (ROUGE-l) as a means of assessing informativeness and the longest common subsequence (ROUGE-L) as a means of assessing fluency. |
Experimental Results | As shown in Table 2a, decoding with a single variational n-gram model (VM) as per (14) improves the Viterbi baseline (except the case with a unigram VM), though often not statistically significant. |
Experimental Results | The interpolation between a VM and a word penalty feature (“wp”) improves over the unigram |
Experimental Results | This is necessarily true, but it is interesting to see that most of the improvement is obtained just by moving from a unigram to a bigram model. |
Inference | where n(t) and n(t, t’) are, respectively, unigram and bigram tag counts excluding those containing character w. Conversely, n’(t) and n’(t,t’) are, respectively, unigram and bigram tag counts only including those containing character w. The notation am] denotes the ascending factorial: a(a + l) - - - (a +n — 1). |
Inference | where is the unigram count of character w, and n(t’) is the unigram count of tag 75, over all characters tokens (including 7.0). |
Inference | where n(j, 19,25) and n(j, 19,75, 25’) are the numbers of languages currently assigned to cluster k which have more than j occurrences of unigram (t) and bigram (t, t’ ), respectively. |
Log-Linear Models | The features used in this experiment were unigrams and bigrams of neighboring words, and unigrams , bigrams and trigrams of neighboring POS tags. |
Log-Linear Models | For the features, we used unigrams of neighboring chunk tags, substrings (shorter than 10 characters) of the current word, and the shape of the word (e. g. “IL-2” is converted into “AA-#”), on top of the features used in the text chunking experiments. |
Log-Linear Models | For the features, we used unigrams and bigrams of neighboring words, prefixes and suffixes of the current word, and some characteristics of the word. |
Experiments | The vocabulary V includes all unigrams after down-casing. |
Experiments | In total, there are 16250 unique unigrams in V. |
Experiments | fication for visualization is we consider only the top 1000 frequent unigrams in the RST—DT training set. |
Task A: Polarity Classification | We studied the influence of unigrams, bigrams and a combination of the two, and saw that the best performing feature set consists of the combination of unigrams and bigrams. |
Task A: Polarity Classification | In this paper, we will refer from now on to n-grams as the combination of unigrams and bigrams. |
Task B: Valence Prediction | Those include n-grams ( unigrams , bigrams and combination of the two), LIWC scores. |
Empirical Evaluation 4.1 Evaluation Setup | In the above experiments, all features ( unigram + bigram) are used. |
The Co-Training Approach | The English or Chinese features used in this study include both unigrams and bigrams5 and the feature weight is simply set to term frequency6. |
The Co-Training Approach | 5 For Chinese text, a unigram refers to a Chinese word and a bigram refers to two adjacent Chinese words. |
Experiments | As the backbone of our string-to-dependency system, we train 3-gram models for left and right dependencies and unigram for head using the target side of the bilingual training data. |
Introduction | In this way, we hope to upgrade the unigram formulation of existing reordering models to a higher order formulation. |
Related Work | Our TNO model is closely related to the Unigram Orientation Model (UOM) (Tillman, 2004), which is the de facto reordering model of phrase-based SMT (Koehn et al., 2007). |
Multimodal Sentiment Analysis | We use a bag-of-words representation of the video transcriptions of each utterance to derive unigram counts, which are then used as linguistic features. |
Multimodal Sentiment Analysis | The remaining words represent the unigram features, which are then associated with a value corresponding to the frequency of the unigram inside each utterance transcription. |
Multimodal Sentiment Analysis | These simple weighted unigram features have been successfully used in the past to build sentiment classifiers on text, and in conjunction with Support Vector Machines (SVM) have been shown to lead to state-of-the-art performance (Maas et al., 2011). |