Abstract | We evaluate our optimizer on Chinese-English and Arabic-English translation tasks, each with small and large feature sets, and show that our learner is able to achieve significant improvements of 1.2-2 BLEU and 1.7-4.3 TER on average over state-of-the-art optimizers with the large feature set . |
Experiments | To evaluate the advantage of explicitly accounting for the spread of the data, we conducted several experiments on two Chinese-English translation test sets, using two different feature sets in each. |
Experiments | We selected the bound step size D, based on performance on a held-out dev set, to be 0.01 for the basic feature set and 0.1 for the sparse feature set . |
Experiments | 4.2 Feature Sets |
Introduction | Chinese-English translation experiments show that our algorithm, RM, significantly outperforms strong state-of-the-art optimizers, in both a basic feature setting and high-dimensional (sparse) feature space (§4). |
Learning in SMT | The instability of MERT in larger feature sets (Foster and Kuhn, 2009; Hopkins and May, 2011), has motivated many alternative tuning methods for SMT. |
Conditional Random Fields | Based on a study of three NLP benchmarks, the authors of (Tsuruoka et al., 2009) claim this approach to be much faster than the orthant-wise approach and yet to yield very comparable performance, while selecting slightly larger feature sets . |
Conditional Random Fields | The n-grm feature sets (n = {1, 3, 5, 7}) includes all features testing embedded windows of k letters, for all 0 g k g n; the n-grm- setting is similar, but only includes the window of length n; in the n-grm+ setting, we add features for odd-size windows; in the n-grm++ setting, we add all sequences of letters up to size n occurring in current window. |
Conditional Random Fields | For instance, the active bigram features at position 75 = 2 in the sequence x=’lemma’ are as follows: the 3-grm feature set contains fywx, fyw/fl and fy/Mem; only the latter appears in the 3-grm- setting. |
Introduction | An important property of CRFs is their ability to handle large and redundant feature sets and to integrate structural dependency between output labels. |
Introduction | Limitating the feature set or the number of output labels is however frustrating for many NLP tasks, where the type and number of potentially relevant features are very large. |
Introduction | Second, the experimental demonstration that using large output label sets is doable and that very large feature sets actually help improve prediction accuracy. |
Introduction | In section 4 we motivate the choice of feature sets for the automatic identification of generic NPs in context. |
Introduction | 4.2 Feature set and feature classes |
Introduction | The feature set includes NP-local and global features. |
Abstract | By adopting this ILP formulation, segmentation F—measure is increased from 0.968 to 0.974, as compared to Viterbi decoding with the same feature set . |
Abstract | We adopt the basic feature set used in (Ratnaparkhi, 1996) and (Collins, 2002). |
Abstract | As introduced in Section 2.2, we adopt a very compact feature set used in (Ratnaparkhi, l996)1. |
Experimental Setup | We experiment with two feature sets for each language: the optimized local feature sets (denoted local), and the optimized local feature sets extended with nonlocal features (denoted nonlocal). |
Features | The feature sets are customized for each language. |
Features | The exact definitions and feature sets that we use are available as part of the download package of our system. |
Features | nonlocal features were selected with the same greedy forward strategy as the local features, starting from the optimized local feature sets . |
Introducing Nonlocal Features | In other words, it is unlikely that we can devise a feature set that is informative enough to allow the weight vector to converge towards a solution that lets the learning algorithm see the entire documents during training, at least in the situation when no external knowledge sources are used. |
Introduction | We show that for the task of coreference resolution the straightforward combination of beam search and early update (Collins and Roark, 2004) falls short of more limited feature sets that allow for exact search. |
Results | the English development set as a function of number of training iterations with two different beam sizes, 20 and 100, over the local and nonlocal feature sets . |
Results | The left half uses the local feature set, and the right the extended nonlocal feature set . |
Results | Local vs. Nonlocal feature sets . |
Experiment Setup 4.1 Corpus | We evaluate six different feature sets for their effectiveness in AVC: SCF, DR, CO, ACO, SCF+CO, and J OANISO7 . |
Experiment Setup 4.1 Corpus | The other four feature sets include both syntactic and lexical information. |
Experiment Setup 4.1 Corpus | JOANISO7: We use the feature set proposed in J oanis et al. |
Introduction | We develop feature sets that combine syntactic and lexical information, which are in principle useful for any Levin-style verb classification. |
Introduction | We test the general applicability and scalability of each feature set to the distinctions among 48 verb classes involving 1,300 verbs, which is, to our knowledge, the largest investigation on English verb classification by far. |
Introduction | To preview our results, a feature set that combines both syntactic information and lexical information works much better than either of them used alone. |
Machine Learning Method | We construct a semantic space with each feature set . |
Machine Learning Method | Except for J ONAISO7 which only contains 224 features, all the other feature sets lead to a very high-dimensional space. |
Related Work | The deeper linguistic analysis allows their feature set to cover a variety of indicators of verb semantics, beyond that of frame information. |
Experiments | In order to evaluate the effectiveness of the cluster-based feature sets , we conducted dependency parsing experiments in English and Czech. |
Experiments | In our English experiments, we tested eight different parsing configurations, representing all possible choices between baseline or cluster-based feature sets , first-order (Eisner, 2000) or second-order (Carreras, 2007) factorizations, and labeled or unlabeled parsing. |
Experiments | Second, note that the parsers using cluster-based feature sets consistently outperform the models using the baseline features, regardless of model order or label usage. |
Feature design | The feature sets we used are similar to other feature sets in the literature (McDonald et al., 2005a; Carreras, 2007), so we will not attempt to give a exhaustive description of the features in this section. |
Feature design | In our experiments, we employed two different feature sets: a baseline feature set which draws upon “normal” information sources such as word forms and parts of speech, and a cluster-based feature set that also uses information derived from the Brown cluster hierarchy. |
Feature design | Our first-order baseline feature set is similar to the feature set of McDonald et al. |
Abstract | We present a fast and scalable online method for tuning statistical machine translation models with large feature sets . |
Adaptive Online Algorithms | When we have a large feature set and therefore want to tune on a large data set, batch methods are infeasible. |
Adaptive Online MT | For example, simple indicator features like lexicalized reordering classes are potentially useful yet bloat the the feature set and, in the worst case, can negatively impact |
Experiments | To the dense features we add three high dimensional “sparse” feature sets . |
Experiments | The primary baseline is the dense feature set tuned with MERT (Och, 2003). |
Experiments | with the PT feature set . |
Experimental Setup | Instead, we use several baselines to demonstrate the usefulness of integrating multiple LCs, as well as the relative usefulness of our feature sets . |
Experimental Setup | The other evaluated systems are formed by taking various subsets of our feature set . |
Experimental Setup | We experiment with 4 feature sets . |
Introduction | (2012) and compare our methods with analogous ones that select a fixed LC, using state-of-the-art feature sets . |
Our Proposal: A Latent LC Approach | Section 3.1 describes our general approach, Section 3.2 presents our model and Section 3.3 details the feature set . |
Our Proposal: A Latent LC Approach | We choose this model for its generality, conceptual simplicity, and because it allows to easily incorporate various feature sets and sets of latent variables. |
Our Proposal: A Latent LC Approach | 3.3 Feature Set |
Discussion | In this section, we analyze the influences of employed feature sets and constraint conditions on the performances. |
Discussion | Because features may interact mutually in an indirect way, even with the same feature set , different constraint conditions can have significant influences on the final performance. |
Discussion | In Section 3, we introduced five candidate feature sets . |
Feature Construction | 3.1 Candidate Feature Set |
Feature Construction | To sum up, among the five candidate feature sets , the position feature is used as a singleton feature. |
Feature Construction | In the following experiments, focusing on Chinese relation extraction, we will analyze the performance of candidate feature sets and study the influence of the constraint conditions. |
Methodology | In this section we describe the basic components of our study: feature sets , graphical model, inference, and evaluation. |
Methodology | 3.1 Input and feature sets |
Methodology | We tested several feature sets either based on, or approximating, the concept of grammatical relation described in section 2. |
Results | We evaluated SCF leXicons based on the eight feature sets described in section 3.1, as well as the VALEX SCF leXicon described in section 2. |
Results | 5.1 Effect of Feature Set Choice |
Results | Table 3 illustrates the result of taking a baseline feature set (containing word as the only feature) and adding a single feature from the Simple set to it. |
Results | Feature Set F-score %Imp word 43.85 —word+nw 43.86 N0 word+na 44.78 2.1 word+lem 45.85 4.6 word+pos 45.91 4.7 word+nw+pos+lem+na 46.34 5.7 |
Experiments | In both experiments we observed that the performance drops when excitation polarities and trouble expressions are removed from the feature set . |
Experiments | PROPOSED-*: The proposed method without the feature set denoted by “*”. |
Experiments | PROPOSED-*z The proposed method without the feature set denoted by “*”. |
Problem Report and Aid Message Recognizers | The feature set given to the SVMs are summarized in the top part of Table 2. |
Problem Report and Aid Message Recognizers | Note that we used a common feature set for both the problem report recognizer and aid message recognizer and that it is categorized into several types: features concerning trouble expressions (TR), excitation polarity (EX), their combination (TREXl) and word sentiment polarity (WSP), features expressing morphological and syntactic structures of nuclei and their context surrounding problem/aid nuclei (MSA), features concerning semantic word classes (SWC) appearing in nuclei and their context, request phrases, such as “Please help us”, appearing in tweets (REQ), and geographical locations in tweets recognized by our location recognizer (GL). |
Problem Report and Aid Message Recognizers | We also attempted to represent nucleus template IDs, noun IDs and their combinations directly in our feature set to capture typical templates fre- |
Problem-Aid Match Recognizer | Here also we attempted to capture typical or frequent matches of nuclei using template and noun IDs and their combinations, but we did not observe any improvement so we omit them from the feature set . |
Problem-Aid Match Recognizer | The bottom part of Table 2 summarizes the additional feature set , some of which are described below in more detail. |
Evaluation and Discussion | The SVMs achieve a similar cross-validated performance on all feature sets containing ngrams, showing only minor improvements for individual flaws when adding non-lexical features. |
Evaluation and Discussion | Table 6 shows the performance of the SVMs with RBF kernel12 on each dataset using the NGRAM feature set . |
Evaluation and Discussion | Classifiers using the N ONGRAM feature set achieved average F 1-scores below 0.50 on all datasets. |
Experiments | We selected a subset of these features for our experiments and grouped them into four feature sets in order to determine how well different combinations of features perform in the task. |
Experiments | Table 4: Feature sets used in the experiments |
Experiments | We evaluate word cluster and embedding (denoted by ED) features by adding them individually as well as simultaneously into the baseline feature set . |
Experiments | This might be explained by the difference between our baseline feature set and the feature set underlying their kernel-based system. |
Feature Set | 5.1 Baseline Feature Set |
Feature Set | (2011) utilize the full feature set from (Zhou et al., 2005) plus some additional features and achieve the state-of-the-art feature-based RE system. |
Feature Set | Unfortunately, this feature set includes the human-annotated (gold-standard) information on entity and mention types which is often missing or noisy in reality (Plank and Moschitti, 2013). |
Introduction | Recent research in this area, whether feature-based (Kambhatla, 2004; Boschee et al., 2005; Zhou et al., 2005; Grishman et al., 2005; Jiang and Zhai, 2007a; Chan and Roth, 2010; Sun et al., 2011) or kernel-based (Zelenko et al., 2003; Bunescu and Mooney, 2005a; Bunescu and Mooney, 2005b; Zhang et al., 2006; Qian et al., 2008; Nguyen et al., 2009), attempts to improve the RE performance by enriching the feature sets from multiple sentence analyses and knowledge resources. |
Generative state tracking | In DIS-CDYNl, we use the original feature set , ignoring the problem described above (so that the general features contribute no information), resulting in M + K weights. |
Generative state tracking | The analysis of various feature sets indicates that the ASR/SLU error correlation (confusion) features yield the largest improvement — c.f. |
Generative state tracking | feature set be compared to b in Table 3. |
Introduction | portance of different feature sets for this task, and measure the amount of data required to reliably train our model. |
Abstract | This regularizes the model complexity and makes the tensor model highly effective in situations where a large feature set is defined but very limited resources are available for training. |
Conclusion and Future Work | This can be regarded as a form of model regular-ization.Therefore, compared with the traditional vector-space models, learning in the tensor space is very effective when a large feature set is defined, but only small amount of training data is available. |
Introduction | This also makes training the model parameters a challenging problem, since the amount of labeled training data is usually small compared to the size of feature sets : the feature weights cannot be estimated reliably. |
Introduction | Such models require learning individual feature weights directly, so that the number of parameters to be estimated is identical to the size of the feature set . |
Tensor Model Construction | ways of mapping, which is an intractable number of possibilities even for modest sized feature sets , making it impractical to carry out a brute force search. |
Tensor Space Representation | In general, if V features are defined for a learning problem, and we (i) organize the feature set as a tensor (I) E Rnlxn2x"'nD and (ii) use H component rank-l tensors to approximate the corresponding target weight tensor. |
Tensor Space Representation | Specifically, a vector space model assumes each feature weight to be a “free” parameter, and estimating them reliably could therefore be hard when training data are not sufficient or the feature set is huge. |
Experiments | Our primary feature set IGC consists of 127 template unigrams that emphasize coarse properties (i.e., properties 7, 9, and 11 in Table 1). |
Experiments | We compare against the language-specific feature sets detailed in the literature on high-resource top-performing SRL systems: From Bj o'rkelund et al. |
Experiments | (2009), these are feature sets for German, English, Spanish and Chinese, obtained by weeks of forward selection (Bdeflmemh); and from Zhao et al. |
Discussion and future work | l.00‘_ Feature Set ; —LIWC 3‘ ‘ —S ntactic «3 .90“. |
Discussion and future work | Figure 3: Effect of feature set choice on cross-validation accuracy. |
Discussion and future work | 2012; Almela et al., 2012; Fomaciari and Poesio, 2012), our results suggest that the set of syntactic features presented here perform significantly better than the LIWC feature set on our data, and across seven out of the eight experiments based on age groups and verbosity of transcriptions. |
Related Work | Descriptions of the data (section 3) and feature sets (section 4) precede experimental results (section 5) and the concluding discussion (section 6). |
Abstract | With a few exceptions, discriminative training in statistical machine translation (SMT) has been content with tuning weights for large feature sets on small development data. |
Discussion | In future work, we would like to investigate more sophisticated features, better learners, and in general improve the components of our system that have been neglected in the current investigation of relative improvements by scaling the size of data and feature sets . |
Experiments | The results on the news-commentary (nc) data show that training on the development set does not benefit from adding large feature sets — BLEU result differences between tuning 12 default features |
Experiments | Here tuning large feature sets on the respective dev sets yields significant improvements of around 2 BLEU points over tuning the 12 default features on the dev sets. |
Introduction | Our resulting models are learned on large data sets, but they are small and outperform models that tune feature sets of various sizes on small development sets. |
Related Work | All approaches have been shown to scale to large feature sets and all include some kind of regularization method. |
Approach | To answer the second research objective we will analyze the contribution of the proposed feature set to this function. |
Approach | For completeness we also include in the feature set the value of the t f - idf similarity measure. |
Experiments | Feature Set MRR P@1 |
Experiments | The algorithm incrementally adds to the feature set the feature that provides the highest MRR improvement in the development partition. |
Related Work | This approach allowed us to perform a systematic feature analysis on a large-scale real-world corpus and a comprehensive feature set . |
Related Work | Our model uses a larger feature set that includes correlation and transformation-based features and five different content representations. |
Abstract | Experiments on applying SSWE to a benchmark Twitter sentiment classification dataset in SemEval 2013 show that (1) the SSWE feature performs comparably with handcrafted features in the top-performed system; (2) the performance is further improved by concatenating SSWE with existing feature set . |
Introduction | After concatenating the SSWE feature with existing feature set , we push the state-of-the-art to 86.58% in macro-Fl. |
Related Work | NRC-ngram refers to the feature set of NRC leaving out ngram features. |
Related Work | After concatenating SSWEU with the feature set of NRC, the performance is further improved to 86.58%. |
Related Work | The concatenated features SSWEu +NRC-ngram (86.48%) outperform the original feature set of NRC (84.73%). |
Acquisition of Hyponymy Relations from Wikipedia | (2008) but LFl—LF5 and SFl-SFQ are the same as their feature set . |
Acquisition of Hyponymy Relations from Wikipedia | Let us provide an overview of the feature sets used in Sumida et al. |
Acquisition of Hyponymy Relations from Wikipedia | These are the feature sets used in Sumida et al. |
Motivation | Since the learning settings ( feature sets , feature values, training data, corpora, and so on) are usually different in two languages, the reliable part in one language may be overlapped by an unreliable part in another language. |
Experiment Two | We derive two types of feature sets from the responses: features derived from each user model and features derived from attributes of the query/ response pair itself. |
Experiment Two | The five feature sets for the user model are: |
Experiment Two | 0 allUtz’lz’ty: 12 features consisting of the high, low, and average utility scores from the previous three feature sets . |
Conclusion | Using this feature set , we obtain an accuracy of 73.0% on a blind test. |
Introduction | This best-performing system uses our new feature set . |
Predicting Direction of Power | We use another feature set LEX to capture word ngrams, POS (part of speech) ngrams and mixed ngrams. |
Predicting Direction of Power | We also performed an ablation study to understand the importance of different slices of our feature sets . |
Structural Analysis | THRPR: This feature set includes two meta- |
Structural Analysis | data based feature sets — positional and verbosity. |
Experiments | Second, note that the parsers incorporating the N-gram feature sets consistently outperform the models using the baseline features in all test data sets, regardless of model order or label usage. |
Web-Derived Selectional Preference Features | In this paper, we employ two different feature sets: a baseline feature set3 which draw upon “normal” information source, such as word forms and part-of-speech (POS) without including the web-derived selectional preference4 features, a feature set conjoins the baseline features and the web-derived selectional preference features. |
Web-Derived Selectional Preference Features | 3 This kind of feature sets are similar to other feature sets in the literature (McDonald et al., 2005; Carreras, 2007), so we will not attempt to give a exhaustive description. |
Web-Derived Selectional Preference Features | For example, the baseline feature set includes indicators for word-to-word and tag-to-tag interactions between the head and modifier of a dependency. |
Abstract | Previous work in traditional text classification and its variants — such as sentiment analysis — has achieved successful results by using the bag-of-words representation; that is, by treating text as a collection of words with no interdependencies, training a classifier on a large feature set of word unigrams which appear in the corpus. |
Abstract | To illustrate, consider the following feature set , a bigram and a trigram (each term in the n-gram either has the form word or Atag): |
Abstract | While the feature set was too small to produce notable results, we identified which features actually were indicative of lect. |
Conclusion | We have conducted exhaustive evaluation with multiple machine learning classifiers and different features sets spanning from lexical information to psychological categories developed by (Tausczik and Pennebaker, 2010). |
Task A: Polarity Classification | We studied the influence of unigrams, bigrams and a combination of the two, and saw that the best performing feature set consists of the combination of unigrams and bigrams. |
Task A: Polarity Classification | For each information source (metaphor, context, source, target and their combinations), we built a separate n-gram feature set and model, which was evaluated on 10-fold cross validation. |
Task A: Polarity Classification | We have used different feature sets and information sources to solve the task. |
Task B: Valence Prediction | We have studied different feature sets and information sources to solve the task. |
Experiments | Table 4: Overall F1 (%) of NER and Accuracy (%) of N EN with different feature sets . |
Experiments | Table 4 shows the overall performance of our method with various feature set combinations, where F0, F; and F9 denote the orthographic features, the lexical features, and the gazetteer-related features, respectively. |
Our Method | (2) {$21521 and {$225521 are two feature sets . |
Our Method | 4.3.1 Feature Set One: {$21) 5:11 |
Our Method | 4.3.2 Feature Set Two: {$22) 5:21 |
Results and discussion | For all four other questions, the best feature set is Continuity, which is a combination of summarization specific features, coreference features and cosine similarity of adjacent sentences. |
Results and discussion | Feature set Gram. |
Results and discussion | Feature set Gram. |
Challenges for Discriminative SMT | This problem of over-fitting is exacerbated in discriminative models with large, expressive, feature sets . |
Challenges for Discriminative SMT | Learning with a large feature set requires many training examples and typically many iterations of a solver during training. |
Evaluation | To do this we use our own implementation of Hiero (Chiang, 2007), with the same grammar but with the traditional generative feature set trained in a linear model with minimum BLEU training. |
Evaluation | The feature set includes: a trigram language model (lm) trained |
Evaluation | The relative scores confirm that our model, with its minimalist feature set, achieves comparable performance to the standard feature set without the language model. |
Conclusion | Finally, further efforts to engineer a grammar suitable for realization from the CCGbank should provide richer feature sets , which, as our feature ablation study suggests, are useful for boosting hypertagging performance, hence for finding better and more complete realizations. |
Results and Discussion | The the whole feature set was found in feature ablation testing on the development set to outperform all other feature subsets significantly (p < 2.2 - 10—16). |
Results and Discussion | The full feature set outperforms all others significantly (p < 2.2 - 10—16). |
Results and Discussion | The results for the full feature set on Sections ()0 and 23 are outlined in Table 2. |
Experiments | Experimental results are given in Table 2, where we also provide the number of features in each feature set . |
Experiments | Figure 1: ROC curves for classifiers trained using different feature sets (English SVO and AN test sets). |
Experiments | According to ROC plots in Figure 1, all three feature sets are effective, both for SVO and for AN tasks. |
Related Work | Current work builds on this study, and incorporates new syntactic relations as metaphor candidates, adds several new feature sets and different, more reliable datasets for evaluating results. |
Syllabification Experiments | In this section, we will discuss the results of our best emission feature set (five-gram features with a context window of eleven letters) on held-out unseen test sets. |
Syllabification with Structured SVMs | With SVM-HMM, the crux of the task is to create a tag scheme and feature set that produce good results. |
Syllabification with Structured SVMs | After experimenting with the development set, we decided to include in our feature set a window of eleven characters around the focus character, five on either side. |
Syllabification with Structured SVMs | As is apparent from Figure 2, we see a substantial improvement by adding bigrams to our feature set . |
Future Work | We are also interested to see how well this feature set performs on speech data, as in (Aoki et al., 2003). |
Related Work | They motivate a richer feature set , which, however, does not yet appear to be implemented. |
Related Work | (2005) adds word repetition to their feature set . |
Related Work | Our feature set incorporates information which has proven useful in meeting segmentation (Galley et al., 2003) and the task of detecting addressees of a specific utterance in a meeting (J ovanovic et al., 2006). |
Maximum Entropy Based Model for Hindi NER | In Table 2 we have shown the accuracy values for few feature sets . |
Maximum Entropy Based Model for Hindi NER | Again when wi_2 and rut-+2 are deducted from the feature set (i.e. |
Maximum Entropy Based Model for Hindi NER | When suffix, prefix and digit information are added to the feature set , the f-value is increased upto 74.26. |
Causal Relations for Why-QA | We used the three types of feature sets in Table 3 for training the CRFs, where j is in the range of z' — 4 g j g i + 4 for current position i in a causal relation candidate. |
Causal Relations for Why-QA | More detailed information concerning the configurations of all the nouns in all the candidates of an appropriate causal relation (including their cause parts) and the question are encoded into our feature set 6 f1—e f4 in Table 4 and the final judgment is done by our re-ranker. |
Experiments | We evaluated the performance when we removed one of the three types of features (ALL-“MORPH”, ALL-“SYNTACTIC” and ALL-“C-MARKER”) and compared the results in these settings with the one when all the feature sets were used (ALL). |
Experiments | We confirmed that all the feature sets improved the performance, and we got the best performance when using all of them. |
Experiment | To compare our joint inference versus other learning models, we also employed a decision tree (DT) learner, equipped with the same feature set as our FCRF. |
Experiment | Both models take the whole feature set described in Section 2.3. |
Experiment | 3.4.3 Feature set evaluation |
Empirical Evaluation: Simile-derived Representations | Suspecting that a noisy feature set had contributed to the apparent drop in performance, these authors then proceed to apply a variety of noise filters to reduce the set of feature values to 51,345, which in turn leads to an improved cluster purity measure of 62.7%. |
Empirical Evaluation: Simile-derived Representations | In experiment 2, we see a similar ratio of feature quantities before filtering; after some initial filtering, Almuhareb and Poesio reduce their feature set to just under 10 times the size of the simile-derived feature set . |
Empirical Evaluation: Simile-derived Representations | First, the feature representations do not need to be hand-filtered and noise-free to be effective; we see from the above results that the raw values extracted from the simile pattern prove slightly more effective than filtered feature sets used by Almuhareb and Poesio. |
Related Work | As noted by the latter authors, this results in a much smaller yet more diagnostic feature set for each concept. |
Distant Supervision | However, we did not find a cumulative effect (line 8) of the two feature sets . |
Features | We refer to these feature sets as CoreLex (CX) and VerbNet (VN) features and to their combination as semantic features (SEM). |
Features | feature set is referred to as named entities (NE). |
Features | We refer to this feature set as sequential features (SQ). |
Experimental Comparison with Unsupervised Learning | With this feature set , the CRF model is less expressive than DMV. |
Experimental Comparison with Unsupervised Learning | The CRF cannot consider valency even with the full feature set , but this is balanced by the ability to use distance. |
Experimental Comparison with Unsupervised Learning | First we note that GE training using the full feature set substantially outperforms the restricted feature set , despite the fact that the same set of constraints is used for both experiments. |
Conclusion and Outlook | Conceputalizing MT evaluation as an entailment problem motivates the use of a rich feature set that covers, unlike almost all earlier metrics, a wide range of linguistic levels, including lexical, syntactic, and compositional phenomena. |
Expt. 2: Predicting Pairwise Preferences | Feature set Consis- System-level tency (%) correlation (p) |
Introduction | (2005)), and thus predict the quality of MT hypotheses with a rich RTE feature set . |
Regression-based MT Quality Prediction | (2007) train binary classifiers on a feature set formed by a number of MT metrics. |
Introduction | Second, although the feature set is fundamentally a combination of those used in previous works (Zhang and Clark, 2010; Huang and Sagae, 2010), to integrate them in a single incremental framework is not straightforward. |
Model | The feature set of our model is fundamentally a combination of the features used in the state-of-the-art joint segmentation and POS tagging model (Zhang and Clark, 2010) and dependency parser (Huang and Sagae, 2010), both of which are used as baseline models in our experiment. |
Model | All of the models described above except Dep’ are based on the same feature sets for segmentation and |
Related Works | Zhang and Clark (2008) proposed an incremental joint segmentation and POS tagging model, with an effective feature set for Chinese. |
Clustering Methods, Evaluation Metrics and Experimental Setup | Table 1: Sample output for a cluster produced with the grid-scf-sem feature set and the IGNGF clustering method. |
Features and Data | Table 4(a) includes the evaluation results for all the feature sets when using IGNGF clustering. |
Features and Data | In terms of features, the best results are obtained using the grid-scf-sem feature set with an F-measure of 0.70. |
Features and Data | In contrast, the classification obtained using the scf-synt-sem feature set has a higher CMP for the clustering with optimal mPUR (0.57); but a lower F—measure (0.61), a larger number of classes (16) |
Approach | 3.2 Feature set |
Approach | Our full feature set is as follows: |
Conclusions and future work | The addition of an incoherence metric to the feature set of an AA system has been shown to improve performance significantly (Miltsakaki and Kukich, 2000; Miltsakaki and Kukich, 2004). |
Validity tests | Although the above modifications do not exhaust the potential challenges a deployed AA system might face, they represent a threat to the validity of our system since we are using a highly related feature set . |
Arabic Word Segmentation Model | This feature set also allows the model to take into account other interactions between the beginning and end of a word, particularly those involving the definite article Jl al-. |
Arabic Word Segmentation Model | A notable property of this feature set is that it remains highly dialect-agnostic, even though our additional features were chosen in response to errors made on text in Egyptian dialect. |
Error Analysis | 0 errors that can be fixed with a fuller analysis of just the problematic token, and therefore represent a deficiency in the feature set ; and |
Error Analysis | In 36 of the 100 sampled errors, we conjecture that the presence of the error indicates a shortcoming of the feature set , resulting in segmentations that make sense locally but are not plausible given the full token. |
Machine learning-based cache model | Therefore, the intra-sentential and inter-sentential zero-anaphora resolution models are separately trained by exploiting different feature sets as shown in Table 2. |
Machine learning-based cache model | Table 1: Feature set used in the cache models |
Machine learning-based cache model | The feature set used in the cache model is shown in Table l. The ‘CASE_MARKER’ feature roughly captures the salience of the local transition dealt with in Centering Theory, and is also intended to capture the global foci of a text coupled with the BEGINNING feature. |
Related Work | But, importantly, our classifiers all use the same feature set so they do not represent independent views of the data. |
Related Work | The feature set for these classifiers is exactly the same as described in Section 3.2, except that we add a new lexical feature that represents the head noun of the target NP (i.e., the NP that needs to be tagged). |
Related Work | 3But technically this is not co-training because our feature sets are all the same. |
Empirical Analysis | Results on the Boundary Detection BD task are obtained by training an SVM model on the same feature set presented in (J ohansson and Nugues, 2008b) and are slightly below the state-of-the art BD accuracy reported in (Coppola et al., 2009). |
Empirical Analysis | Given the relatively simple feature set adopted here, this result is very significant as for its resulting efficiency. |
Introduction | Notice how this is also a general problem of statistical learning processes, as large fine grain feature sets are more exposed to the risks of overfitting. |
Related Work | While these approaches increase the expressive power of the models to capture more general linguistic properties, they rely on complex feature sets , are more demanding about the amount of training information and increase the overall exposure to overfitting effects. |
Dependency Parsing: Baseline | With notations defined in Table l, a feature set as shown in Table 2 is adopted. |
Dependency Parsing: Baseline | We used a large scale feature selection approach as in (Zhao et al., 2009) to obtain the feature set in Table 2. |
Evaluation Results | The results with different feature sets are in Table 4. |
Evaluation Results | Table 4: The results with different feature sets features with p without p |
Dependency Parsing with HPSG | Therefore, we extend this feature set by adding four more feature categories, which are similar to the original ones, but the dependency relation was replaced by the dependency backbone of the HP S G outputs. |
Dependency Parsing with HPSG | The extended feature set is shown in Table 1. |
Dependency Parsing with HPSG | The extended feature set is shown in Table 2 (the new features are listed separately). |
Evaluation framework | 4.2 Performance indicator and feature set |
Evaluation framework | As our focus is on the algorithmic aspect, in all experiments we use the same feature set , which consists of the seventeen features proposed in (Specia et al., 2009). |
Evaluation framework | This feature set , fully described in (Callison-Burch et al., 2012), takes into account the complexity of the source sentence (e. g. number of tokens, number of translations per source word) and the fluency of the target translation (e. g. language model probabilities). |
Annotations | Table 2 shows the performance of our feature set in grammars with several different levels of structural annotation.3 Klein and Manning (2003) find large gains (6% absolute improvement, 20% relative improvement) going from v = 0, h = 0 to v = l, h = 1; however, we do not find the same level of benefit. |
Features | Table 1 shows the results of incrementally building up our feature set on the Penn Treebank development set. |
Introduction | Our parser can be easily adapted to this task by replacing the X-bar grammar over treebank symbols with a grammar over the sentiment values to encode the output variables and then adding n-gram indicators to our feature set to capture the bulk of the lexical effects. |
Copula Models for Text Regression | This nice property essentially allows us to fuse distinctive lexical, syntactic, and semantic feature sets naturally into a single compact model. |
Experiments | Feature sets: |
Experiments | To do this, we sample equal amount of features from each feature set , and concatenate |
Abstract | Third, to enhance the power of parsing models, we enlarge the feature set with nonlocal features and semi-supervised word cluster features. |
Experiment | We built three new parsing systems based on the StateAlign system: Nonlocal system extends the feature set of StateAlign system with nonlocal features, Cluster system extends the feature set with semi-supervised word cluster features, and Nonlocal & Cluster system extend the feature set with both groups of features. |
Related Work | Finally, we enhanced our parsing model by enlarging the feature set with nonlocal features and semi-supervised word cluster features. |
CRF and features | The work describes a feature set proposed for this task, which includes word forms in a local window, values of grammatical class, gender, number and case, tests for agreement on number, gender and case, as well as simple tests for letter case. |
CRF and features | We took this feature set as a starting point. |
CRF and features | The final feature set includes the following |
Empirical Evaluation | To compare classification performance, we use two feature sets : (i) standard word + POS 1-4 grams and (ii) AD-expressions from $5. |
Empirical Evaluation | Predicting agreeing arguing nature is harder than that of disagreeing across all feature settings . |
Empirical Evaluation | Using the discovered AD-eXpressions (Table 6, last low) as features renders a statistically significant (see Table 6 caption) improvement over other baseline feature settings . |
Features | The principal feature sets are listed in Table 2, together with an indication whether they are novel or have been used in previous work. |
Speaker Identification | Table 2: Principal feature sets . |
Speaker Identification | subsequently we add three more feature sets that represent the following neighboring utterances: n — 2, n — l and n + l. Informally, the features of the utterances n — l and n + l encode the first observation, while the features representing the utterance n — 2 encode the second observation. |
Related Work | (2007) used a maximum entropy classifier trained on a feature set that includes the use of gazetteers and a stop-word list, appearance of a NE in the training set, leading and trailing word bigrams, and the tag of the previous word. |
Related Work | (2008), they examined the same feature set on the Automatic Content Extraction (ACE) datasets using CRF |
Related Work | Abdul-Hamid and Darwish (2010) used a simplified feature set that relied primarily on character level features, namely leading and trailing letters in a word. |
Experiments | Models labeled X/Y use learning algorithm X and feature set Y. |
Experiments | The feature set DP+ contains TF—IDF, DP alignment, dictionary, and length features. |
Experiments | The results on the test fold are shown in Figure l, which compares the learning algorithms, and Figure 2, which compares feature sets . |
Beyond lexical CLTE | builds on two additional feature sets , derived from i) semantic phrase tables, and ii) dependency relations. |
Experiments and results | (a) In both settings all the feature sets used outperform the approaches taken as terms of comparison. |
Experiments and results | As shown in Table 1, the combined feature set (PT+SPT+DR) significantly5 outperforms the leXical model (64.5% vs 62.6%), while SPT and DR features separately added to PT (PT+SPT, and PT+DR) lead to marginal improvements over the results achieved by the PT model alone (about 1%). |
Experiments | us to compare the model-fitting capacity of different feature sets from another perspective, especially when the training data is not sufficiently well fitted by the model. |
Method | We refine Hernault et al.’s original feature set by incorporating our own features as well as some adapted from Lin et al. |
Method | (2009) also incorporated contextual features in their feature set . |
Decoding | With the rich feature set in Table l, the running time of Intersect is longer than the time of Rescoring. |
Experiments | Table 2 shows the feature settings of the systems, where MSTl/2 refers to the basic first—/second-order parser and MSTB l/2 refers to the enhanced first-/second-order parser. |
Experiments | MSTBl and MSTB2 used the same feature setting , but used different order models. |
Automated Approaches to Deceptive Opinion Spam Detection | Specifically, we consider the following three n-gram feature sets, with the corresponding features lowercased and unstemmed: UNIGRAMS, BIGRAMS+, TRIGRAMS+, where the superscript + indicates that the feature set subsumes the preceding feature set . |
Automated Approaches to Deceptive Opinion Spam Detection | We consider all three n-gram feature sets , namely UNIGRAMS, BIGRAMS+, and TRIGRAMS+, with corresponding language models smoothed using the interpolated Kneser-Ney method (Chen and Goodman, 1996). |
Automated Approaches to Deceptive Opinion Spam Detection | We use SVMlight (Joachims, 1999) to train our linear SVM models on all three approaches and feature sets described above, namely POS, LIWC, UNIGRAMS, BIGRAMS+, and TRIGRAMS+. |
Learning Algorithm | The held-out portion is used to tune the feature set and results are reported for the test split only, i.e., using unseen instances. |
Learning Algorithm | We improve BASIC with an extended feature set which targets especially A1 and the verb (Table 5). |
Negation in Natural Language | The main contributions are: (l) interpretation of negation using focus detection; (2) focus of negation annotation over all PropBank negated sen-tencesl; (3) feature set to detect the focus of negation; and (4) model to semantically represent negation and reveal its underlying positive meaning. |
Discussion | The held-out results in Figure 2 suggest that the combination of syntactic and lexical features provides better performance than either feature set on its own. |
Evaluation | At most recall levels, the combination of syntactic and lexical features offers a substantial improvement in precision over either of these feature sets on its own. |
Evaluation | No feature set strongly outperforms any of the others across all relations. |
Conclusions | Future research directions include developing rich feature sets and using corpus level or external information. |
Experiments | Since different feature sets , NLP tools, etc are used in different benchmarked systems, we are also interested in comparing the proposed algorithm with different soft relational clustering variants. |
Experiments | With the same feature sets and distance function, KARC-S outperforms FRC in F score by about 5%. |
Corpus Details | However, stopwords were retained in the feature set as various sociolinguistic studies have shown that use of some of the stopwords, for instance, pronouns and determin-ers, are correlated with age and gender. |
Corpus Details | Also, only the ngrams with frequency greater than 5 were retained in the feature set following Boulis and Ostendorf (2005). |
Related Work | Another relevant line of work has been on the blog domain, using a bag of words feature set to discriminate age and gender (Schler et al., 2006; Burger and Henderson, 2006; Nowson and Oberlander, 2006). |
Dependency parsing for machine translation | The three feature sets that were used in our experiments are shown in Table 2. |
Dependency parsing for machine translation | It is quite similar to the McDonald (2005a) feature set , except that it does not include the set of all POS tags that appear between each candidate head-modifier pair (i , j). |
Dependency parsing for machine translation | The primary difference between our feature sets and the ones of McDonald et a1. |
Experiments | Our feature set is summarized in Table 2, which closely follows Chamiak and Johnson (2005), except that we excluded the nonlocal features Edges, NGram, and CoPar, and simplified Rule and NGramTree features, since they were too complicated to compute.4 We also added four unlexicalized local features from Collins (2000) to cope with data-sparsity. |
Experiments | tures in the updated version.5 However, our initial experiments show that, even with this much simpler feature set , our 50-best reranker performed equally well as theirs (both with an F-score of 91.4, see Tables 3 and 4). |
Experiments | This result confirms that our feature set design is appropriate, and the averaged perceptron learner is a reasonable candidate for reranking. |
Experimental Setup | Table 1: Performance of EDITDIST and our model with various features sets on EN -ES-W. See section 5. |
Experimental Setup | We will use MCCA (for matching CCA) to denote our model using the optimal feature set (see section 5.3). |
Introduction | As an example of the performance of the system, in English-Spanish induction with our best feature set , using corpora derived from topically similar but nonparallel sources, the system obtains 89.0% precision at 33% recall. |