Background | These generally consist of a projection layer that maps words, sub-word units or n-grams to high dimensional embeddings; the latter are then combined component-wise with an operation such as summation. |
Background | The trained weights in the filter m correspond to a linguistic feature detector that learns to recognise a specific class of n-grams . |
Background | These n-grams have size n g m, where m is the width of the filter. |
Experiments | tures based on long n-grams and to hierarchically combine these features is highly beneficial. |
Experiments | In the first layer, the sequence is a continuous n-gram from the input sentence; in higher layers, sequences can be made of multiple separate n-grams . |
Experiments | The feature detectors learn to recognise not just single n-grams, but patterns within n-grams that have syntactic, semantic or structural significance. |
Introduction | Since individual sentences are rarely observed or not observed at all, one must represent a sentence in terms of features that depend on the words and short n-grams in the sentence that are frequently observed. |
Introduction | by which the features of the sentence are extracted from the features of the words or n-grams . |
Properties of the Sentence Model | The filters m of the wide convolution in the first layer can learn to recognise specific n-grams that have size less or equal to the filter width m; as we see in the experiments, m in the first layer is often set to a relatively large value |
Properties of the Sentence Model | The subsequence of n-grams extracted by the generalised pooling operation induces in-variance to absolute positions, but maintains their order and relative positions. |
Properties of the Sentence Model | This gives the RNN excellent performance at language modelling, but it is suboptimal for remembering at once the n-grams further back in the input sentence. |
Abstract | Our model incorporates heterogeneous relational evidence about both hypernymy and siblinghood, captured by semantic features based on patterns and statistics from Web n-grams and Wikipedia abstracts. |
Analysis | Here, our Web 71- grams dataset (which only contains frequent n-grams ) and Wikipedia abstracts do not suffice and we would need to add richer Web data for such world knowledge to be reflected in the features. |
Experiments | Feature sources: The n-gram semantic features are extracted from the Google n-grams corpus (Brants and Franz, 2006), a large collection of English n-grams (for n = 1 to 5) and their frequencies computed from almost 1 trillion tokens (95 billion sentences) of Web text. |
Experiments | For this, we use a hash-trie on term pairs (similar to that of Bansal and Klein (2011)), and scan once through the 71- gram (or abstract) set, skipping many n-grams (or abstracts) based on fast checks of missing unigrams, exceeding length, suffix mismatches, etc. |
Features | The Web n-grams corpus has broad coverage but is limited to up to 5-grams, so it may not contain pattem-based evidence for various longer multi-word terms and pairs. |
Features | Similar to the Web n-grams case, we also fire Wikipedia-based pattern order features. |
Introduction | The belief propagation approach allows us to efficiently and effectively incorporate heterogeneous relational evidence via hypernymy and siblinghood (e.g., coordination) cues, which we capture by semantic features based on simple surface patterns and statistics from Web n-grams and Wikipedia abstracts. |
Related Work | 8All the patterns and counts for our Web and Wikipedia edge and sibling features described above are extracted after stemming the words in the terms, the n-grams , and the abstracts (using the Porter stemmer). |
Experimental Evaluation | CNG is a pro-file-based method which represents the author as the N most frequent character n-grams of all his/her training texts. |
Proposed Tri-Training Algorithm | The features in the character view are the character n-grams of a document. |
Proposed Tri-Training Algorithm | Character n-grams are simple and easily available for any natural language. |
Proposed Tri-Training Algorithm | We use four content-independent structures including n-grams of POS tags (n = 1..3) and rewrite rules (Kim et al., 2011). |
Related Work | Example features include function words (Argamon et al., 2007), richness features (Gamon 2004), punctuation frequencies (Graham et al., 2005), character (Grieve, 2007), word (Burrows, 1992) and POS n-grams (Gamon, 2004; Hirst and Feiguina, 2007), rewrite rules (Halteren et al., 1996), and similarities (Qian and Liu, 2013). |
Our Approach | The features we use are character n-grams around mismatches. |
Our Approach | i) n-grams around gaps, i.e., we account only for insertions and deletions; |
Our Approach | ii) n-grams around any type of mismatch, i.e., we account for all three types of mismatches. |
Abstract | Contrasting the predictive ability of statistics derived from 6 different corpora, we find intuitive results showing that, e.g., a British corpus over-predicts the speed with which an American will react to the words ward and duke, and that the Google n-grams over-predicts familiarity with technology terms. |
Fitting Behavioral Data 2.1 Data | 2Surprisingly, fife was determined to be one of the words with the largest frequency asymmetry between Switchboard and the Google n-grams corpus. |
Introduction | Specifically, we predict human data from three widely used psycholinguistic experimental paradigms—lexical decision, word naming, and picture naming—using unigram frequency estimates from Google n-grams (Brants and Franz, 2006), Switchboard (Godfrey et al., 1992), spoken and written English portions of CELEX (Baayen et al., 1995), and spoken and written portions of the British National Corpus (BNC Consortium, 2007). |
Introduction | For example, Google n-grams overestimates the ease with which humans will process words related to the web (tech, code, search, site), while the Switchboard corpus—a collection of informal telephone conversations between strangers—overestimates how quickly humans will react to colloquialisms (heck, dam) and backchannels (wow, right). |
Generation & Propagation | For the unlabeled phrases, the set of possible target translations could be extremely large (e.g., all target language n-grams ). |
Generation & Propagation | A nai've way to achieve this goal would be to extract all n-grams , from n = l to a maximum n-gram order, from the monolingual data, but this strategy would lead to a combinatorial explosion in the number of target phrases. |
Generation & Propagation | This set of candidate phrases is filtered to include only n-grams occurring in the target monolingual corpus, and helps to prune passed-through OOV words and invalid translations. |
Introduction | Unlike previous work (Irvine and Callison-Burch, 2013a; Razmara et al., 2013), we use higher order n-grams instead of restricting to unigrams, since our approach goes beyond OOV mitigation and can enrich the entire translation model by using evidence from monolingual text. |
Experiments | Following evaluations in machine translation as well as previous work in sentence compression (Unno et al., 2006; Clarke and Lapata, 2008; Martins and Smith, 2009; Napoles et al., 2011b; Thadani and McKeown, 2013), we evaluate system performance using F1 metrics over n-grams and dependency edges produced by parsing system output with RASP (Briscoe et al., 2006) and the Stanford parser. |
Experiments | We report results over the following systems grouped into three categories of models: tokens + n-grams , tokens + dependencies, and joint models. |
Introduction | Joint methods have also been proposed that invoke integer linear programming (ILP) formulations to simultaneously consider multiple structural inference problems—both over n-grams and input dependencies (Martins and Smith, 2009) or n-grams and all possible dependencies (Thadani and McKeown, 2013). |
Multi-Structure Sentence Compression | C. In addition, we define bigram indicator variables yij E {0, l} to represent whether a particular order-preserving bigram2 (ti, tj> from S is present as a contiguous bigram in C as well as dependency indicator variables zij E {0, 1} corresponding to whether the dependency arc ti —> 253- is present in the dependency parse of C. The score for a given compression 0 can now be defined to factor over its tokens, n-grams and dependencies as follows. |
Introduction | Recent approaches to this task have been based on slot-filling (Yang et al., 2011; Elliott and Keller, 2013), combining web-scale n-grams (Li et al., 2011), syntactic tree substitution (Mitchell et al., 2012), and description-by-retrieval (Farhadi et al., 2010; Ordonez et al., 2011; Hodosh et al., 2013). |
Methodology | pn measures the effective overlap by calculating the proportion of the maximum number of n-grams co-occurring between a candidate and a reference and the total number of n-grams in the candidate text. |
Methodology | (2012); to the best of our knowledge, the only image description work to use higher-order n-grams with BLEU is Elliott and Keller (2013). |
Experiments | 4.4 Analysis of Different Effects of Different N-grams |
Experiments | To evaluate the effects of different n-grams for our proposed transfer model, we compared the uni-/bi-/tri-gram transfer models in SMT, and illustrate the results in Fig- |
Experiments | Different translation qualities along with different n-grams for transfer model. |
Introduction | Yet, though many language models more sophisticated than N- grams have been proposed, N-grams are empirically hard to beat in terms of WER. |
Motivation | Given the rise of unsupervised latent topic modeling with Latent Dirchlet Allocation (Blei et al., 2003) and similar latent variable approaches for discovering meaningful word co-occurrence patterns in large text corpora, we ought to be able to leverage these topic contexts instead of merely N-grams . |
Motivation | information retrieval, and again, interpolate latent topic models with N-grams to improve retrieval performance. |