Ad hoc rule detection | 3.4 Bigram anomalies 3.4.1 Motivation |
Ad hoc rule detection | The bigram method examines relationships between adjacent sisters, complementing the whole rule method by focusing on local properties. |
Ad hoc rule detection | But only the final elements have anomalous bigrams : HD:ID IR:IR, IR:IR ANzRO, and ANzRO J RzIR all never occur. |
Additional information | This rule is entirely correct, yet the XXzXX position has low whole rule and bigram scores. |
Approach | First, the bigram method abstracts a rule to its bigrams . |
Evaluation | For example, the bigram method with a threshold of 39 leads to finding 283 errors (455 x .622). |
Evaluation | The whole rule and bigram methods reveal greater precision in identifying problematic dependencies, isolating elements with lower UAS and LAS scores than with frequency, along with corresponding greater pre- |
Introduction and Motivation | We propose to flag erroneous parse rules, using information which reflects different grammatical properties: POS lookup, bigram information, and full rule comparisons. |
Introduction | Most methods have employed some variant of Expectation Maximization (EM) to learn parameters for a bigram |
Introduction | Ravi and Knight (2009) achieved the best results thus far (92.3% word token accuracy) via a Minimum Description Length approach using an integer program (IP) that finds a minimal bigram grammar that obeys the tag dictionary constraints and covers the observed data. |
Minimized models for supertagging | The 1241 distinct supertags in the tagset result in 1.5 million tag bigram entries in the model and the dictionary contains almost 3.5 million word/tag pairs that are relevant to the test data. |
Minimized models for supertagging | The set of 45 P08 tags for the same data yields 2025 tag bigrams and 8910 dictionary entries. |
Minimized models for supertagging | Our objective is to find the smallest supertag grammar (of tag bigram types) that explains the entire text while obeying the lexicon’s constraints. |
Experiments | We follow prior work and use sets of bigrams within words. |
Experiments | In our case, during bipartite matching the set X is the set of bigrams in the language being re-permuted, and Y is the union of bigrams in the other languages. |
Experiments | Besides the heuristic baseline, we tried our model-based approach using Unigrams, Bigrams and Anchored Unigrams, with and without learning the parametric edit distances. |
Message Approximation | Figure 2: Various topologies for approximating topologies: (a) a unigram model, (b) a bigram model, (c) the anchored uni gram model, and (d) the n-best plus backoff model used in Dreyer and Eisner (2009). |
Message Approximation | The first is a plain unigram model, the second is a bigram model, and the third is an anchored unigram topology: a position-specific unigram model for each position up to some maximum length. |
Message Approximation | The second topology we consider is the bigram topology, illustrated in Figure 2(b). |
Simulation 1 | Our reader’s language model was an unsmoothed bigram model created using a vocabulary set con- |
Simulation 1 | From this vocabulary, we constructed a bigram model using the counts from every bigram in the BNC for which both words were in vocabulary (about 222,000 bigrams ). |
Simulation 1 | Specifically, we constructed the model’s initial belief state (i.e., the distribution over sentences given by its language model) by directly translating the bigram model into a wFSA in the log semiring. |
Simulation 2 | Instead, we begin with the same set of bigrams used in Sim. |
Simulation 2 | 1 — i.e., those that contain two in-vocabulary words — and trim this set by removing rare bigrams that occur less than 200 times in the BNC (except that we do not trim any bigrams that occur in our test corpus). |
Simulation 2 | This reduces our set of bigrams to about 19,000. |
Conditional Random Fields | In the sequel, we distinguish between two types of feature functions: unigramfea-tures fyflc, associated with parameters Hwy, and bigram features fy/Mgc, associated with parameters Aggy)? |
Conditional Random Fields | On the other hand, bigram features {fy/,y,$}(y,$)€y2xX are helpful in modelling dependencies between successive labels. |
Conditional Random Fields | Assume the set of bigram features {Ag/,y,$t+l}(y/,y)€y2 is sparse with only r(:ct+1) << |Y 2 non null values and define the |Y| >< |Y| sparse matrix |
Experiments | 0 Bigrams: an implementation of the bigram classifier for soft pattern matching proposed by Cui et al. |
Experiments | The probability is calculated as a mixture of bigram and |
Experiments | WCL—l 99.88 42.09 59.22 76.06 WCL—3 98.81 60.74 75.23 83.48 Star patterns 86.74 66.14 75.05 81.84 Bigrams 66.70 82.70 73.84 75.80 |
Introduction | where we explicitly distinguish the unigram feature function o; and bigram feature function Comparing the form of the two functions, we can see that our discussion on HMMs can be extended to perceptrons by substituting 2k wigbflwwn) and 2k wg¢%(w,yn_1,yn) for logp(:cn|yn) and 10gp(yn|yn—1)- |
Introduction | For bigram features, we compute its upper bound offline. |
Introduction | The simplest case is that the bigram features are independent of the token sequence :13. |
Experiments and Discussions | We use R-l (recall against unigrams), R-2 (recall against bigrams), and R-SU4 (recall against skip-4 bigrams ). |
Experiments and Discussions | Note that R-2 is a measure of bigram recall and sumHLDA of HybHSumg is built on unigrams rather than bigrams . |
Regression Model | We similarly include bigram features in the experiments. |
Regression Model | We also include bigram extensions of DMF features. |
Regression Model | We use sentence bigram frequency, sentence rank in a document, and sentence size as additional fea- |
Clustering-based word representations | The Brown algorithm is a hierarchical clustering algorithm which clusters words to maximize the mutual information of bigrams (Brown et al., 1992). |
Clustering-based word representations | So it is a class-based bigram language model. |
Clustering-based word representations | One downside of Brown clustering is that it is based solely on bigram statistics, and does not consider word usage in a wider context. |
Experiments | Excluding qt_1 2 qt bigrams (leading to 0.32M frames from 2.39M frames in “all”) offers a glimpse of expected performance differences were duration modeling to be included in the models. |
Limitations and Desiderata | To produce Figures 1 and 2, a small fraction of probability mass was reserved for unseen bigram transitions (as opposed to backing off to unigram probabilities). |
The Extended-Degree-of-Overlap Model | The EDO model mitigates R-specificity because it models each bigram (qt_1, qt) 2 (8,, S j) as the modified bigram (m, [0ij,nj]), involving three scalars each of which is a sum — a commutative (and therefore rotation-invariant) operation. |
Automatic Metaphor Recognition | They use hyponymy relation in WordNet and word bigram counts to predict metaphors at a sentence level. |
Automatic Metaphor Recognition | Hereby they calculate bigram probabilities of verb-noun and adjective-noun pairs (including the hyponyms/hypernyms of the noun in question). |
Automatic Metaphor Recognition | However, by using bigram counts over verb-noun pairs Krishnakumaran and Zhu (2007) loose a great deal of information compared to a system extracting verb-object relations from parsed text. |