Distribution Prediction | To reduce the dimensionality of the feature space , and create dense representations for words, we perform SVD on F. We use the left singular vectors corresponding to the k largest singular values to compute a rank k approximation F, of F. We perform truncated SVD using SVDLIBCZ. |
Distribution Prediction | Each row in F is considered as representing a word in a lower k (<<nc) dimensional feature space corresponding to a particular domain. |
Distribution Prediction | Distribution prediction in this lower dimensional feature space is preferrable to prediction over the original feature space because there are reductions in overfit-ting, feature sparseness, and the learning time. |
Domain Adaptation | increased the train time due to the larger feature space . |
Experiments and Results | Therefore, when the overlap be-ween the vocabularies used in the source and the arget domains is small, fired cannot reduce the mismatch between the feature spaces . |
Experiments and Results | All methods are evalu-ted under the same settings, including train/test plit, feature spaces , pivots, and classification al-;orithms so that any differences in performance an be directly attributable to their domain adapt-,bility. |
Introduction | tent feature spaces separately for the source and the target domains using Singular Value Decomposition (SVD). |
Introduction | Second, we learn a mapping from the source domain latent feature space to the target domain latent feature space using Partial Least Square Regression (PLSR). |
Introduction | The SVD smoothing in the first step both reduces the data sparseness in distributional representations of individual words, as well as the dimensionality of the feature space , thereby enabling us to efficiently and accurately learn a prediction model using PLSR in the second step. |
O \ | Because the dimensionality of the source and target domain feature spaces is equal to h, the complexity of the least square regression problem increases with h. Therefore, larger k values result in overfitting to the train data and classification accuracy is reduced on the target test data. |
Abstract | In effect, our approach finds an optimal feature space (derived from a base feature set and indicator set) for discriminating coreferential mention pairs. |
Abstract | Although our approach explores a very large space of possible feature spaces , it remains tractable by exploiting the structure of the hierarchies built from the indicators. |
Introduction | It is worth noting that, from a machine learning point of view, this is related to feature extraction in that both approaches in effect recast the pairwise classification problem in higher dimensional feature spaces . |
Introduction | We will see that this is also equivalent to selecting a single large adequate feature space by using the data. |
Modeling pairs | Given a document, the number of mentions is fixed and each pair of mentions follows a certain distribution (that we partly observe in a feature space ). |
Modeling pairs | 2.2 Feature spaces 2.2.1 Definitions |
Modeling pairs | that casts pairs into a feature space F through which we observe them. |
Abstract | In this paper, we present a new learning scenario, heterogeneous transfer learning, which improves learning performance when the data can be in different feature spaces and where no correspondence between data instances in these spaces is provided. |
Image Clustering with Annotated Auxiliary Data | Let .7: = { be an image feature space , and V = {vfigll be the image data set. |
Introduction | Traditional machine learning relies on the availability of a large amount of data to train a model, which is then applied to test data in the same feature space . |
Introduction | A commonality among these methods is that they all require the training data and test data to be in the same feature space . |
Introduction | However, in practice, we often face the problem where the labeled data are scarce in their own feature space, whereas there may be a large amount of labeled heterogeneous data in another feature space . |
Related Works | This example is related to our image clustering problem because they both rely on data from different feature spaces . |
Related Works | Most learning algorithms for dealing with cross-language heterogeneous data require a translator to convert the data to the same feature space . |
Related Works | For those data that are in different feature spaces where no translator is available, Davis and Domingos (2008) proposed a Markov-logic-based transfer learning algorithm, which is called deep transfer, for transferring knowledge between biological domains and Web domains. |
Bilingual Tree Kernels | In order to compute the dot product of the feature vectors in the exponentially high dimensional feature space , we introduce the tree kernel functions as follows: |
Bilingual Tree Kernels | As a result, we propose the dependent Bilingual Tree kernel (dBTK) to jointly evaluate the similarity across subtree pairs by enlarging the feature space to the Cartesian product of the two substructure sets. |
Bilingual Tree Kernels | Here we verify the correctness of the kernel by directly constructing the feature space for the inner product. |
Introduction | Both kernels can be utilized within different feature spaces using various representations of the substructures. |
Substructure Spaces for BTKs | Given feature spaces defined in the last two sections, we propose a 2-phase subtree alignment model as follows: |
Substructure Spaces for BTKs | Feature Space P R F |
Substructure Spaces for BTKs | Feature Space P R F |
Abstract | In this work, we develop and evaluate a wide range of feature spaces for deriving Levin—style verb classifications (Levin, 1993). |
Abstract | We perform the classification experiments using Bayesian Multinomial Regression (an efficient log—linear modeling framework which we found to outperform SVMs for this task) with the proposed feature spaces . |
Experiment Setup 4.1 Corpus | Since one of our primary goals is to identify a general feature space that is not specific to any class distinctions, it is of great importance to understand how the classification accuracy is affected when attempting to classify more verbs into a larger number of classes. |
Integration of Syntactic and Lexical Information | ACO features integrate at least some degree of syntactic information into the feature space . |
Related Work | They define a general feature space that is supposed to be applicable to all Levin classes. |
Related Work | (2007) demonstrates that the general feature space they devise achieves a rate of error reduction ranging from 48% to 88% over a chance baseline accuracy, across classification tasks of varying difficulty. |
Related Work | However, they also show that their general feature space does not generally improve the classification accuracy over subcategorization frames (see table 1). |
Results and Discussion | Another feature set that combines syntactic and lexical information, ACO, which keeps function words in the feature space to preserve syntactic information, outperforms the conventional CO on the majority of tasks. |
Conclusion | In addition, we showed that our system is a flexible and modular framework that is able to learn from data with different quality (perfect vs noisy markable detection) and domain; and is able to deliver good results for shallow information spaces and competitive results for rich feature spaces . |
Introduction | Typical systems use a rich feature space based on lexical, syntactic and semantic knowledge. |
Introduction | We view association information as an example of a shallow feature space which contrasts with the rich feature space that is generally used in CoRe. |
Introduction | The feature spaces are the shallow and rich feature spaces . |
Related Work | These researchers show that a “deterministic” system (essentially a rule-based system) that uses a rich feature space including lexical, syntactic and semantic features can improve CoRe performance. |
Results and Discussion | To summarize, the advantages of our self-training approach are: (i) We cover cases that do not occur in the unlabeled corpus (better recall effect); and (ii) we use the leveraging effect of a rich feature space including distance, person, number, gender etc. |
Background | We build a contextual feature space , described in section 4.2, to enhance our baseline bag-of-words model. |
Background | 4.2 Contextual Feature Space Additions |
Background | 0 Baseline: This model uses a bag-of-words feature space as input to an SVM classifier. |
Conclusions and Future Work | We have introduced RM, a novel online margin-based algorithm designed for optimizing high-dimensional feature spaces , which introduces constraints into a large-margin optimizer that bound the spread of the projection of the data while maXimizing the margin. |
Conclusions and Future Work | Experimentation in statistical MT yielded significant improvements over several other state-of-the-art optimizers, especially in a high-dimensional feature space (up to 2 BLEU and 4.3 TER on average). |
Introduction | However, as the dimension of the feature space increases, generalization becomes increasingly difficult. |
Introduction | This criterion performs well in practice at finding a linear separator in high-dimensional feature spaces (Tsochantaridis et al., 2004; Crammer et al., 2006). |
Introduction | Chinese-English translation experiments show that our algorithm, RM, significantly outperforms strong state-of-the-art optimizers, in both a basic feature setting and high-dimensional (sparse) feature space (§4). |
Learning in SMT | Online large-margin algorithms, such as MIRA, have also gained prominence in SMT, thanks to their ability to learn models in high-dimensional feature spaces (Watanabe et al., 2007; Chiang et al., 2009). |
Discussion and Future Directions | The Quality assessing component itself could be built as a module that can be adjusted to the kind of Social Media in use; the creation of customized Quality feature spaces would make it possible to handle different sources of UGC (forums, collaborative authoring websites such as Wikipedia, blogs etc.). |
Discussion and Future Directions | A great obstacle is the lack of systematically available high quality training examples: a tentative solution could be to make use of clustering algorithms in the feature space ; high and low quality clusters could then be labeled by comparison with examples of virtuous behavior (such as Wikipedia’s Featured Articles). |
Experiments | To demonstrate it, we conducted a set of experiments on the original unfiltered dataset to establish whether the feature space \11 was powerful enough to capture the quality of answers; our specific objective was to estimate the |
Related Work | (2008) which inspired us in the design of the Quality feature space presented in Section 2.1. |
The summarization framework | feature space to capture the following syntactic, behavioral and statistical properties: |
The summarization framework | The features mentioned above determined a space \II; An answer a, in such feature space , assumed the vectorial form: |
Abstract | We study the polarity-bearing topics extracted by J ST and show that by augmenting the original feature space with polarity-bearing topics, the in-domain supervised classifiers learned from augmented feature representation achieve the state-of-the-art performance of 95% on the movie review data and an average of 90% on the multi-domain sentiment dataset. |
Introduction | We study the polarity-bearing topics extracted by the JST model and show that by augmenting the original feature space with polarity-bearing topics, the performance of in-domain supervised classifiers learned from augmented feature representation improves substantially, reaching the state-of-the-art results of 95% on the movie review data and an average of 90% on the multi-domain sentiment dataset. |
Joint Sentiment-Topic (J ST) Model | In this paper, we have studied polarity-bearing topics generated from the J ST model and shown that by augmenting the original feature space with polarity-bearing topics, the in-domain supervised classifiers learned from augmented feature representation achieve the state-of-the-art performance on both the movie review data and the multi-domain sentiment dataset. |
Joint Sentiment-Topic (J ST) Model | First, polarity-bearing topics generated by the J ST model were simply added into the original feature space of documents, it is worth investigating attaching different weight to each topic |
Related Work | proposed a kemel-mapping function which maps both source and target domains data to a high-dimensional feature space so that data points from the same domain are twice as similar as those from different domains. |
Prediction Experiments | In terms of the size of vocabulary W for both the SME and SVM learner, we select three values to represent dense, medium or sparse feature spaces : W1 2 29, W2 2 212, and the full vocabulary size of W3 2 213'8. |
Prediction Experiments | For example, with a medium density feature space of 212, SVM obtained an accuracy of 35.8%, but SME achieved an accuracy of 40.9%, which is a 14.2% relative improvement (p < 0.001) over SVM. |
Prediction Experiments | When the feature space becomes sparser, the SME obtains an increased relative improvement (10 < 0.001) of 16.1%, using full size of vocabulary. |
Conclusion | The size of the employed lexicon determines the dimension of the feature space . |
Discussion | Combining two head nouns may increase the feature space |
Discussion | Such a large feature space makes the occurrence of features close to a random distribution, leading to a worse data sparseness. |
Feature Construction | Because the number of lexicon entry determines the dimension of the feature space , performance of Omni-word feature is influenced by the lexicon being employed. |
Related Work | (2010) proposed a model handling the high dimensional feature space . |
Experiments | Experiments evaluate the FWD and SemTree feature spaces compared to two baselines: bag-of-words (BOW) and supervised latent Dirichlet allocation (sLDA) (Blei and McAuliffe, 2007). |
Experiments | SVM-light with tree kernels3 (Joachims, 2006; Moschitti, 2006) is used for both the FWD and SemTree feature spaces . |
Methods | 4.2 SemTree Feature Space and Kernels |
Methods | We propose SemTree as another feature space to encode semantic information in trees. |
Related Work | We explore a rich feature space that relies on frame semantic parsing. |
Cue Discovery for Content Selection | Our feature space X = {$1, :52, . |
Cue Discovery for Content Selection | We search only top candidates for efficiency, following the fixed-width search methodology for feature selection in very high-dimensionality feature spaces (Gutlein et al., 2009). |
Experimental Results | We use a binary unigram feature space , and we perform 7-fold cross-va1idation. |
Prediction | One challenge of this approach is our underlying unigram feature space - tree-based algorithms are generally poor classifiers for the high-dimensionality, low-information features in a lexical feature space (Han et al., 2001). |
Prediction | We exhaustively sweep this feature space , and report the most successful stump rules for each annotation task. |
Copula Models for Text Regression | By doing this, we are essentially performing probability integral transform— an important statistical technique that moves beyond the count-based bag-of-words feature space to marginal cumulative density functions space. |
Discussions | By applying the Probability Integral Transform to raw features in the copula model, we essentially avoid comparing apples and oranges in the feature space , which is a common problem in bag-of-features models in NLP. |
Experiments | model over squared loss linear regression model are increasing, when working with larger feature spaces . |
Related Work | For example, when bag-of-word-unigrams are present in the feature space , it is easier if one does not explicitly model the stochastic dependencies among the words, even though doing so might hurt the predictive power, while the variance from the correlations among the random variables is not explained. |
Abstract | The score of tag predictions are usually computed in a high-dimensional feature space . |
Abstract | Consider a character-based feature function gb(c, t, c) that maps a character-tag pair to a high-dimensional feature space , with respect to an input character sequence c. For a possible word over c of length l , w, = 0,0 . |
Abstract | In Section 3.4, we describe a way of mapping words to a character-based feature space . |
Introduction | In addition, more advanced regularisation functions enable multitask learning schemes that can exploit shared structure in the feature space . |
Methods | Group LASSO exploits a predefined group structure on the feature space and tries to achieve sparsity in the group-level, i.e. |
Methods | In this optimisation process, we aim to enforce sparsity in the feature space but in a structured manner. |
Conclusion and Future Work | Another direction involves incorporating richer feature space for better inference performance, such as multimedia sources (i.e. |
Experiments | We evaluate settings described in Section 4.2 i.e., GLOBAL setting, where user-level attribute is predicted directly from jointly feature space and LOCAL setting where user-level prediction is made based on tweet-level prediction along with different inference approaches described in Section 4.4, i.e. |
Experiments | This can be explained by the fact that LOCAL(U) sets 256 = 1 once one posting cc 6 L5 is identified as attribute related, while GLOBAL tend to be more meticulous by considering the conjunctive feature space from all postings. |
Wikipedia-based Composite Kernel for Dialog Topic Tracking | Since our hypothesis is that the more similar the dialog histories of the two inputs are, the more similar aspects of topic transtions occur for them, we propose a subsequence kernel (Lodhi et al., 2002) to map the data into a new feature space defined based on the similarity of each pair of history sequences as follows: |
Wikipedia-based Composite Kernel for Dialog Topic Tracking | The other kernel incorporates more various types of domain knowledge obtained from Wikipedia into the feature space . |
Wikipedia-based Composite Kernel for Dialog Topic Tracking | Since this constructed tree structure represents semantic, discourse, and structural information extracted from the similar Wikipedia paragraphs to each given instance, we can explore these more enriched features to build the topic tracking model using a subset tree kernel (Collins and Duffy, 2002) which computes the similarity between each pair of trees in the feature space as follows: |
Abstract | Our method is based on recent advances in the field of statistical machine learning (multivariate capabilities of Support Vector Machines) and a rich feature space . |
Building a Discourse Parser | This makes SVM well-fitted to treat classification problems involving relatively large feature spaces such as ours (% 105 features). |
Evaluation | The feature space dimension is 136,987. |
Methods 2.1 Document Level and Profile Based CDC | Kemelization (Scholkopf and Smola, 2002) is a machine learning technique to transform patterns in the data space to a high-dimensional feature space so that the structure of the data can be more easily and adequately discovered. |
Methods 2.1 Document Level and Profile Based CDC | Using the kernel trick, the squared distance between (19(rj) and (l)(wi) in the feature space H can be computed as: |
Methods 2.1 Document Level and Profile Based CDC | We measure the kemelized XBI (KXBI) in the feature space as, |
Cross-Language Structural Correspondence Learning | MASK(x, pl) is a function that returns a copy of x where the components associated with the two words in p; are set to zero—which is equivalent to removing these words from the feature space . |
Cross-Language Structural Correspondence Learning | Since (6Tv)T = VT6 it follows that this view of CL-SCL corresponds to the induction of a new feature space given by Equation 2. |
Cross-Language Text Classification | Le, documents from the training set and the test set map on two non-overlapping regions of the feature space . |
Introduction | In NLP applications, the dimension of the feature space tends to be very large—it can easily become several millions, so the application of L1 penalty to all features significantly slows down the weight updating process. |
Log-Linear Models | Since the dimension of the feature space can be very large, it can significantly slow down the weight update process. |
Log-Linear Models | Another merit is that it allows us to perform the application of L1 penalty in a lazy fashion, so that we do not need to update the weights of the features that are not used in the current sample, which leads to much faster training when the dimension of the feature space is large. |
Abstract | We present a novel hierarchical prior structure for supervised transfer learning in named entity recognition, motivated by the common structure of feature spaces for this task across natural language data sets. |
Introduction | In particular, we develop a novel prior for named entity recognition that exploits the hierarchical feature space often found in natural language domains (§l.2) and allows for the transfer of information from labeled datasets in other domains (§l.3). |
Introduction | Representing feature spaces with this kind of tree, besides often coinciding with the explicit language used by common natural language toolkits (Cohen, 2004), has the added benefit of allowing a model to easily back-off, or smooth, to decreasing levels of specificity. |
Abstract | The resulting argument classification model promotes a simpler feature space that limits the potential overfitting effects. |
Introduction | The model adopts a simple feature space by relying on a limited set of grammatical properties, thus reducing its learning capacity. |
Introduction | As we will see, the accuracy reachable through a restricted feature space is still quite close to the state-of-art, but interestingly the performance drops in out-of-domain tests are avoided. |
Abstract | Because considering such features would increase the size of the feature space , we suspected that including these features would also benefit from algorithmic means of selecting n-grams that are indicative of particular lects, and even from binning these relevant n-grams into sets to be used as features. |
Abstract | Although this approach to partitioning is simple and worthy of improvement, it effectively reduced the dimensionality of the feature space . |
Abstract | Therefore, as we explored the feature space , small bins of different n-gram lengths were merged. |