Abstract | We evaluate our optimizer on Chinese-English and Arabic-English translation tasks, each with small and large feature sets, and show that our learner is able to achieve significant improvements of 1.2-2 BLEU and 1.7-4.3 TER on average over state-of-the-art optimizers with the large feature set . |
Experiments | To evaluate the advantage of explicitly accounting for the spread of the data, we conducted several experiments on two Chinese-English translation test sets, using two different feature sets in each. |
Experiments | We selected the bound step size D, based on performance on a held-out dev set, to be 0.01 for the basic feature set and 0.1 for the sparse feature set . |
Experiments | 4.2 Feature Sets |
Introduction | Chinese-English translation experiments show that our algorithm, RM, significantly outperforms strong state-of-the-art optimizers, in both a basic feature setting and high-dimensional (sparse) feature space (§4). |
Learning in SMT | The instability of MERT in larger feature sets (Foster and Kuhn, 2009; Hopkins and May, 2011), has motivated many alternative tuning methods for SMT. |
Abstract | We present a fast and scalable online method for tuning statistical machine translation models with large feature sets . |
Adaptive Online Algorithms | When we have a large feature set and therefore want to tune on a large data set, batch methods are infeasible. |
Adaptive Online MT | For example, simple indicator features like lexicalized reordering classes are potentially useful yet bloat the the feature set and, in the worst case, can negatively impact |
Experiments | To the dense features we add three high dimensional “sparse” feature sets . |
Experiments | The primary baseline is the dense feature set tuned with MERT (Och, 2003). |
Experiments | with the PT feature set . |
Experiments | In both experiments we observed that the performance drops when excitation polarities and trouble expressions are removed from the feature set . |
Experiments | PROPOSED-*: The proposed method without the feature set denoted by “*”. |
Experiments | PROPOSED-*z The proposed method without the feature set denoted by “*”. |
Problem Report and Aid Message Recognizers | The feature set given to the SVMs are summarized in the top part of Table 2. |
Problem Report and Aid Message Recognizers | Note that we used a common feature set for both the problem report recognizer and aid message recognizer and that it is categorized into several types: features concerning trouble expressions (TR), excitation polarity (EX), their combination (TREXl) and word sentiment polarity (WSP), features expressing morphological and syntactic structures of nuclei and their context surrounding problem/aid nuclei (MSA), features concerning semantic word classes (SWC) appearing in nuclei and their context, request phrases, such as “Please help us”, appearing in tweets (REQ), and geographical locations in tweets recognized by our location recognizer (GL). |
Problem Report and Aid Message Recognizers | We also attempted to represent nucleus template IDs, noun IDs and their combinations directly in our feature set to capture typical templates fre- |
Problem-Aid Match Recognizer | Here also we attempted to capture typical or frequent matches of nuclei using template and noun IDs and their combinations, but we did not observe any improvement so we omit them from the feature set . |
Problem-Aid Match Recognizer | The bottom part of Table 2 summarizes the additional feature set , some of which are described below in more detail. |
Evaluation and Discussion | The SVMs achieve a similar cross-validated performance on all feature sets containing ngrams, showing only minor improvements for individual flaws when adding non-lexical features. |
Evaluation and Discussion | Table 6 shows the performance of the SVMs with RBF kernel12 on each dataset using the NGRAM feature set . |
Evaluation and Discussion | Classifiers using the N ONGRAM feature set achieved average F 1-scores below 0.50 on all datasets. |
Experiments | We selected a subset of these features for our experiments and grouped them into four feature sets in order to determine how well different combinations of features perform in the task. |
Experiments | Table 4: Feature sets used in the experiments |
Generative state tracking | In DIS-CDYNl, we use the original feature set , ignoring the problem described above (so that the general features contribute no information), resulting in M + K weights. |
Generative state tracking | The analysis of various feature sets indicates that the ASR/SLU error correlation (confusion) features yield the largest improvement — c.f. |
Generative state tracking | feature set be compared to b in Table 3. |
Introduction | portance of different feature sets for this task, and measure the amount of data required to reliably train our model. |
Discussion and future work | l.00‘_ Feature Set ; —LIWC 3‘ ‘ —S ntactic «3 .90“. |
Discussion and future work | Figure 3: Effect of feature set choice on cross-validation accuracy. |
Discussion and future work | 2012; Almela et al., 2012; Fomaciari and Poesio, 2012), our results suggest that the set of syntactic features presented here perform significantly better than the LIWC feature set on our data, and across seven out of the eight experiments based on age groups and verbosity of transcriptions. |
Related Work | Descriptions of the data (section 3) and feature sets (section 4) precede experimental results (section 5) and the concluding discussion (section 6). |
Conclusion | We have conducted exhaustive evaluation with multiple machine learning classifiers and different features sets spanning from lexical information to psychological categories developed by (Tausczik and Pennebaker, 2010). |
Task A: Polarity Classification | We studied the influence of unigrams, bigrams and a combination of the two, and saw that the best performing feature set consists of the combination of unigrams and bigrams. |
Task A: Polarity Classification | For each information source (metaphor, context, source, target and their combinations), we built a separate n-gram feature set and model, which was evaluated on 10-fold cross validation. |
Task A: Polarity Classification | We have used different feature sets and information sources to solve the task. |
Task B: Valence Prediction | We have studied different feature sets and information sources to solve the task. |
Causal Relations for Why-QA | We used the three types of feature sets in Table 3 for training the CRFs, where j is in the range of z' — 4 g j g i + 4 for current position i in a causal relation candidate. |
Causal Relations for Why-QA | More detailed information concerning the configurations of all the nouns in all the candidates of an appropriate causal relation (including their cause parts) and the question are encoded into our feature set 6 f1—e f4 in Table 4 and the final judgment is done by our re-ranker. |
Experiments | We evaluated the performance when we removed one of the three types of features (ALL-“MORPH”, ALL-“SYNTACTIC” and ALL-“C-MARKER”) and compared the results in these settings with the one when all the feature sets were used (ALL). |
Experiments | We confirmed that all the feature sets improved the performance, and we got the best performance when using all of them. |
Distant Supervision | However, we did not find a cumulative effect (line 8) of the two feature sets . |
Features | We refer to these feature sets as CoreLex (CX) and VerbNet (VN) features and to their combination as semantic features (SEM). |
Features | feature set is referred to as named entities (NE). |
Features | We refer to this feature set as sequential features (SQ). |
Experiment | To compare our joint inference versus other learning models, we also employed a decision tree (DT) learner, equipped with the same feature set as our FCRF. |
Experiment | Both models take the whole feature set described in Section 2.3. |
Experiment | 3.4.3 Feature set evaluation |
Related Work | (2007) used a maximum entropy classifier trained on a feature set that includes the use of gazetteers and a stop-word list, appearance of a NE in the training set, leading and trailing word bigrams, and the tag of the previous word. |
Related Work | (2008), they examined the same feature set on the Automatic Content Extraction (ACE) datasets using CRF |
Related Work | Abdul-Hamid and Darwish (2010) used a simplified feature set that relied primarily on character level features, namely leading and trailing letters in a word. |
Features | The principal feature sets are listed in Table 2, together with an indication whether they are novel or have been used in previous work. |
Speaker Identification | Table 2: Principal feature sets . |
Speaker Identification | subsequently we add three more feature sets that represent the following neighboring utterances: n — 2, n — l and n + l. Informally, the features of the utterances n — l and n + l encode the first observation, while the features representing the utterance n — 2 encode the second observation. |
Empirical Evaluation | To compare classification performance, we use two feature sets : (i) standard word + POS 1-4 grams and (ii) AD-expressions from $5. |
Empirical Evaluation | Predicting agreeing arguing nature is harder than that of disagreeing across all feature settings . |
Empirical Evaluation | Using the discovered AD-eXpressions (Table 6, last low) as features renders a statistically significant (see Table 6 caption) improvement over other baseline feature settings . |
CRF and features | The work describes a feature set proposed for this task, which includes word forms in a local window, values of grammatical class, gender, number and case, tests for agreement on number, gender and case, as well as simple tests for letter case. |
CRF and features | We took this feature set as a starting point. |
CRF and features | The final feature set includes the following |