Abstract | The created thesaurus is then used to expand feature vectors to train a binary classifier. |
Introduction | a unigram or a bigram of word lemma) in a review using a feature vector . |
Introduction | We model the cross-domain sentiment classification problem as one of feature expansion, where we append additional related features to feature vectors that represent source and target domain reviews in order to reduce the mismatch of features between the two domains. |
Introduction | thesaurus to expand feature vectors in a binary classifier at train and test times by introducing related lexical elements from the thesaurus. |
Sentiment Sensitive Thesaurus | For example, if we know that both excellent and delicious are positive sentiment words, then we can use this knowledge to expand a feature vector that contains the word delicious using the word excellent, thereby reducing the mismatch between features in a test instance and a trained model. |
Sentiment Sensitive Thesaurus | Let us denote the value of a feature 21) in the feature vector u representing a lexical element u by f (u, The vector u can be seen as a compact representation of the distribution of a lexical element u over the set of features that co-occur with u in the reviews. |
Sentiment Sensitive Thesaurus | From the construction of the feature vector u described in the previous paragraph, it follows that 21) can be either a sentiment feature or another lexical element that co-occurs with u in some review sentence. |
Abstract | The first compresses the query into a query feature vector, which aggregates all document instances in the same query, and then conducts query weighting based on the query feature vector . |
Evaluation | Specifically, after document feature aggregation, the number of query feature vectors in all adaptation tasks is no more than 150 in source and target domains. |
Introduction | Take Figure 2 as a toy example, where the document instance is represented as a feature vector with four features. |
Introduction | In this work, we present two simple but very effective approaches attempting to resolve the problem from distinct perspectives: (1) we compress each query into a query feature vector by aggregating all of its document instances, and then conduct query weighting on these query feature vectors ; (2) we measure the similarity between the source query and each target query one by one, and then combine these fine- grained similarity values to calculate its importance to the target domain. |
Query Weighting | The query can be compressed into a query feature vector , where each feature value is obtained by the aggregate of its corresponding features of all documents in the query. |
Query Weighting | We concatenate two types of aggregates to construct the query feature vector : the mean [i = fi Zlqzll |
Query Weighting | is the feature vector of document 2' and |q| denotes the number of documents in q . |
Abstract | In this paper, we use tensors to map high-dimensional feature vectors into low dimensional representations. |
Introduction | The exploding dimensionality of rich feature vectors must then be balanced with the difficulty of effectively learning the associated parameters from limited training data. |
Introduction | We depart from this view and leverage high-dimensional feature vectors by mapping them into low dimensional representations. |
Introduction | We begin by representing high-dimensional feature vectors as multi-way cross-products of smaller feature vectors that represent words and their syntactic relations (arcs). |
Problem Formulation | We can alternatively specify arc features in terms of rank-l tensors by taking the Kronecker product of simpler feature vectors associated with the head (vector gbh E R”), and modifier (vector gbm E R”), as well as the arc itself (vector ohm E Rd). |
Problem Formulation | Here ohm is much lower dimensional than the MST arc feature vector gbhmm discussed earlier. |
Problem Formulation | By taking the crossproduct of all these component feature vectors , we obtain the full feature representation for arc h —> m as a rank-l tensor |
Distributed K-Means clustering | Given a set of elements represented as feature vectors and a number, k, of desired clusters, the K-Means algorithm consists of the following steps: |
Distributed K-Means clustering | Before describing our parallel implementation of the K-Means algorithm, we first describe the phrases to be clusters and how their feature vectors are constructed. |
Distributed K-Means clustering | Following previous approaches to distributional clustering of words, we represent the contexts of a phrase as a feature vector . |
Detailed Problem Formulation | We use discrete features, namely natural numbers, in our feature vectors , quantized by a binning process. |
Detailed Problem Formulation | The length of the feature vector may vary across parts of speech. |
Detailed Problem Formulation | Let NC denote the length of the feature vector for part of speech c, 3073; denote the time-series (mil, . |
Introduction | We associate a feature vector with each frame (detection) of each such track. |
Introduction | This feature vector can encode image features (including the identity of the particular detector that produced that detection) that correlate with object class; region color, shape, and size features that correlate with object properties; and motion features, such as linear and angular object position, velocity, and acceleration, that correlate with event properties. |
Introduction | involves computing the associated feature vector for that HMM over the detections in the tracks chosen to fill its arguments. |
The Sentence Tracker | ,qT) denote the sequence of states qt that leads to an observed track, B (D75, jt,qt,)\) denote the conditional log probability of observing the feature vector associated with the detection selected by jt among the detections D75 in frame t, given that the HMM is in state qt, and A(qt_1, qt, A) denote the log transition probability of the HMM. |
The Sentence Tracker | We further need to generalize F so that it computes the joint score of a sequence of detections, one for each track, G so that it computes the joint measure of coherence between a sequence of pairs of detections in two adjacent frames, and B so that it computes the joint conditional log probability of observing the feature vectors associated with the sequence of detections selected by jt. |
The Sentence Tracker | We further need to generalize B so that it computes the joint conditional log probability of observing the feature vectors for the detections in the tracks that are assigned to the arguments of the HMM for each word in the sentence and A so that it computes the joint log transition probability for the HMMs for all words in the sentence. |
Inference | For each of the remaining feature vectors in |
Inference | Mixture ID (mt) For each feature vector in a segment, given the cluster label CM and the hidden state index st, the derivation of the conditional posterior probability of its mixture ID is straightforward: |
Inference | where mm is the set of mixture IDs of feature vectors that belong to state 8 of HMM c. The mth entry of fl’ is fl + themcqs 6(mt, m), where we use |
Model | Given the cluster label, choose a hidden state for each feature vector :13; in the segment. |
Model | Use the chosen Gaussian mixture to generate the observed feature vector |
Model | 2, where the shaded circle denotes the observed feature vectors , and the squares denote the hyperparameters of the priors used in our model. |
Problem Formulation | 1 illustrates how the speech signal of a single word utterance banana is converted to a sequence of feature vectors to £13711. |
Problem Formulation | Segment (p;- k) We define a segment to be composed of feature vectors between two boundary frames. |
Problem Formulation | Hidden State (8%) Since we assume the observed data are generated by HMMs, each feature vector , 30%, has an associated hidden state index. |
A Statistical Inclusion Measure | Amongst these features, those found in 21’s feature vector are termed included features. |
A Statistical Inclusion Measure | In preliminary data analysis of pairs of feature vectors , which correspond to a known set of valid and invalid expansions, we identified the following desired properties for a distributional inclusion measure. |
A Statistical Inclusion Measure | In our case the feature vector of the expanded word is analogous to the set of all relevant documents while tested features correspond to retrieved documents. |
Background | First, a feature vector is constructed for each word by collecting context words as features. |
Background | where FVgc is the feature vector of a word cc and way (f) is the weight of the feature f in that word’s vector, set to their pointwise mutual information. |
Background | Extending this rationale to the textual entailment setting, Geffet and Dagan (2005) expected that if the meaning of a word it entails that of 2) then all its prominent context features (under a certain notion of “prominence”) would be included in the feature vector of v as well. |
Conclusions and Future work | This paper advocates the use of directional similarity measures for lexical expansion, and potentially for other tasks, based on distributional inclusion of feature vectors . |
Evaluation and Results | Feature vectors were created by parsing the Reuters RCVl corpus and taking the words related to each term through a dependency relation as its features (coupled with the relation name and direction, as in (Lin, 1998)). |
Abstract | In addition, word embedding is employed as the input to the neural network, which encodes each word as a feature vector . |
Introduction | We also integrate word embedding into the model by representing each word as a feature vector (Collobert and Weston, 2008). |
Introduction | .. ,hK(f,e,d))T is a K -dimensional feature vector defined on the tuple (f,e,d); W = (w1,w2,--- ,wK)T is a K-dimensional weight vector of h, i.e., the parameters of the model, and it can be tuned by the toolkit MERT (Och, 2003). |
Introduction | (3) as a function of a feature vector h, i.e. |
Bilingual Lexicon Induction | Then, for each matched pair of word types (2', j ) E m, we need to generate the observed feature vectors of the source and target word types, fs(si) 6 Rd5 and fT(tj) E RdT. |
Bilingual Lexicon Induction | The feature vector of each word type is computed from the appropriate monolingual corpus and summarizes the word’s monolingual characteristics; see section 5 for details and figure 2 for an illustration. |
Bilingual Lexicon Induction | Specifically, to generate the feature vectors , we first generate a random concept 2M N N(0, Id), where Id is the d x d identity matrix. |
Features | For a concrete example of a word type to feature vector mapping, see figure 2. |
Inference | 4Since ds and dT can be quite large in practice and often greater than lml, we use Cholesky decomposition to re-represent the feature vectors as lml-dimensional vectors with the same dot products, which is all that CCA depends on. |
Introduction | In our method, we represent each language as a monolingual lexicon (see figure 2): a list of word types characterized by monolingual feature vectors , such as context counts, orthographic substrings, and so on (section 5). |
Conclusions and Future Work | We use standard feature vectors augmented by shallow syntactic trees enriched with additional conceptual information. |
Conclusions and Future Work | This paper makes several contributions: (i) it shows that effective OM can be carried out with supervised models trained on high quality annotations; (ii) it introduces a novel annotated corpus of YouTube comments, which we make available for the research community; (iii) it defines novel structural models and kernels, which can improve on feature vectors , e.g., up to 30% of relative improvement in type classification, when little data is available, and demonstrates that the structural model scales well to other domains. |
Related work | Most of the previous work on supervised sentiment analysis use feature vectors to encode documents. |
Representations and models | In the next sections, we define a baseline feature vector model and a novel structural model based on kernel methods. |
Representations and models | We go beyond traditional feature vectors by employing structural models (S TRUCT), which encode each comment into a shallow syntactic tree. |
Representations and models | A polynomial kernel of degree 3 is applied to feature vectors (FVE C). |
Distribution Prediction | 3.1 In-domain Feature Vector Construction |
Distribution Prediction | 3.2 Cross-Domain Feature Vector Prediction |
Distribution Prediction | We model distribution prediction as a multivariate regression problem where, given a set {(109,109) £121 consisting of pairs of feature vectors selected from each domain for the pivots in W, we learn a mapping |
Domain Adaptation | First, we lemmatise each word in a source domain labeled review 335;), and extract both unigrams and bigrams as features to represent mg) by a binary-valued feature vector . |
Domain Adaptation | Next, we train a binary classification model, 6, using those feature vectors . |
Domain Adaptation | At test time, we represent a test target review H using a binary-valued feature vector h of unigrams and bigrams of lemmas of the words in H, as we did for source domain labeled train reviews. |
Related Work | The created thesaurus is used to expand feature vectors during train and test stages in a binary classifier. |
Answer Grading System | We use Mani, $8) to denote the feature vector associated with a pair of nodes (55,-, :58), where :10,- is a node from the instructor answer A, and x8 is a node from the student answer A8. |
Answer Grading System | For a given answer pair (A1, As), we assemble the eight graph alignment scores into a feature vector |
Answer Grading System | We combine the alignment scores $001,, A8) with the scores ¢B(Ai, As) from the lexical semantic similarity measures into a single feature vector ¢(A,-,AS) = [¢G(A,-,AS)|¢B(A,-,AS)]. |
Results | We report the results of running the systems on three subsets of features ¢(Ai, A8): BOW features ¢B(Ai, As) only, alignment features $901,, As) only, or the full feature vector (labeled “Hybrid”). |
Architecture | If a sentence contains two entities and those entities are an instance of one of our Freebase relations, features are extracted from that sentence and are added to the feature vector for the relation. |
Architecture | In training, the features for identical tuples (relation, entityl, entity2) from different sentences are combined, creating a richer feature vector . |
Architecture | This time, every pair of entities appearing together in a sentence is considered a potential relation instance, and whenever those entities appear together, features are extracted on the sentence and added to a feature vector for that entity pair. |
Implementation | Towards this end, we build a feature vector in the training phase for an ‘unrelated’ relation by randomly selecting entity pairs that do not appear in any Freebase relation and extracting features for them. |
Implementation | Our classifier takes as input an entity pair and a feature vector , and returns a relation name and a confidence score based on the probability of the entity pair belonging to that relation. |
Introduction | For each pair of entities, we aggregate the features from the many different sentences in which that pair appeared into a single feature vector , allowing us to provide our classifier with more information, resulting in more accurate labels. |
Error Detection with a Maximum Entropy Model | To formalize this task, we use a feature vector w to represent a word 21) in question, and a binary variable 0 to indicate whether this word is correct or not. |
Error Detection with a Maximum Entropy Model | In the feature vector , we look at 2 words before and 2 words after the current word position (w_2, w_1, 212,201,202). |
Error Detection with a Maximum Entropy Model | We collect features {wd, p03, link, dwpp} for each word among these words and combine them into the feature vector w for 212. |
Related Work | However, when we create feature vectors for the classifier, the seeds themselves are hidden and only contextual features are used to represent each training instance. |
Related Work | We use an in-house sentence segmenter and NP chunker to identify the base NPs in each sentence and create feature vectors that represent each constituent in the sentence as either an NP or an individual word. |
Related Work | Two training instances would be created, with feature vectors that look like this, where M represents a modifier inside the target NP: |
Bayesian MT Decipherment via Hash Sampling | One possible strategy is to compute similarity scores 8(Wfi, we/) between the current source word feature vector Wfi and feature vectors we/Eve for all possible candidates in the target vocabulary. |
Bayesian MT Decipherment via Hash Sampling | This makes the complexity far worse (in practice) since the dimensionality of the feature vectors d is a much higher value than Computing similarity scores alone (nai'vely) would incur O(|Ve| - d) time which is prohibitively huge since we have to do this for every token in the source language corpus. |
Feature-based representation for Source and Target | But unlike documents, here each word w is associated with a feature vector wl...wd (where wi represents the weight for the feature indexed by i) which is constructed from monolingual corpora. |
Feature-based representation for Source and Target | Unlike the target word feature vectors (which can be precomputed from the monolingual target corpus), the feature vector for every source word fj is dynamically constructed from the target translation sampled in each training iteration. |
Feature-based representation for Source and Target | ), it results in the feature representation becoming more sparse (especially for source feature vectors ) which can cause problems in efficiency as well as robustness when computing similarity against other vectors. |
Training Algorithm | (a) Generate a proposal distribution by computing the hamming distance between the feature vectors for the source word and each target translation candidate. |
Integrated Models | by a k-dimensional feature vector f : X —> R’“. |
Integrated Models | In the feature-based integration we simply extend the feature vector for one model, called the base model, with a certain number of features generated by the other model, which we call the guide model in this context. |
Integrated Models | The additional features will be referred to as guide features, and the version of the base model trained with the extended feature vector will be called the guided model. |
Methodology | Each SVO (or AN) instance will be represented by a triple (duple) from which a feature vector will be extracted.5 The vector will consist of the concatenation of the conceptual features (which we discuss below) for all participating words, and conjunction features for word pairs.6 For example, to generate the feature vector for the SVO triple (car, drink, gasoline), we compute all the features for the individual words car, drink, gasoline and combine them with the conjunction features for the pairs car drink and drink gasoline. |
Methodology | 6If word one is represented by features u E R” and word two by features v E Rm then the conjunction feature vector is the vectorization of the outer product uvT. |
Model and Feature Extraction | Degrees of membership in different supersenses are represented by feature vectors , where each element corresponds to one supersense. |
Model and Feature Extraction | For example, the top-level classes in GermaNet include: adj.feeling (e.g., willing, pleasant, cheerful); adj.sabstance (e.g., dry, ripe, creamy); adj.spatial (e.g., adjacent, gigantic).12 For each adjective type in WordNet, they produce a vector with a classifier posterior probabilities corresponding to degrees of membership of this word in one of the 13 semantic classes,13 similar to the feature vectors we build for nouns and verbs. |
Model and Feature Extraction | For languages other than English, feature vectors are projected to English features using translation dictionaries. |
Related Work | Instead, we create new feature vectors Fgen on the basis of the feature vectors Fseed in S. For each class in S, we extract all attribute-value pairs from the feature vectors for this particular class. |
Related Work | For each class, we randomly select features (with replacement) from Fseed and combine them into a new feature vector Fgen, retaining the distribution of the different classes in the data. |
Related Work | As a result, we obtain a more general set of feature vectors Fgen with characteristic features being distributed more evenly over the different feature vectors . |
Experiments & Results | For the classifier-based system, we tested various different feature vector configurations. |
Experiments & Results | The various erY configurations use the same feature vector setup for all classifier experts. |
Experiments & Results | The auto configuration does not uniformly apply the same feature vector setup to all classifier experts but instead seeks to find the optimal setup per classifier expert. |
System | The feature vector for the classifiers represents a local context of neighbouring words, and optionally also global context keywords in a binary-valued bag-of-words configuration. |
System | If not, we check for the presence of a classifier expert for the offered L1 fragment; only then we can proceed by extracting the desired number of L2 local context words to the immediate left and right of this fragment and adding those to the feature vector . |
Learning with Homogenous Data | High-dimensional feature vectors with only several nonzero dimensions bring large time consumption to our model. |
Learning with Homogenous Data | Thus it is necessary to reduce the dimension of the feature vectors . |
The Deep Belief Network for QA pairs | In the bottom layer, the binary feature vectors based on the statistics of the word occurrence in the answers are used to compute the “hidden features” in the |
The Deep Belief Network for QA pairs | j where 0'(x) = 1/ (l + e"‘), 3 denotes the visible feature vector of the answer, qi is the ith element of the question vector, and h stands for the hidden feature vector for reconstructing the questions. |
The Deep Belief Network for QA pairs | To detect the best answer to a given question, we just have to send the vectors of the question and its candidate answers into the input units of the network and perform a level-by-level calculation to obtain the corresponding feature vectors . |
Transliteration alignment techniques | Withgott and Chen (1993) define a feature vector of phonological descriptors for English sounds. |
Transliteration alignment techniques | We extend the idea by defining a 21-element binary feature vector for each English and Chinese phoneme. |
Transliteration alignment techniques | Each element of the feature vector represents presence or absence of a phonological descriptor that differentiates various kinds of phonemes, e.g. |
Background | Quite a few methods have been suggested (Lin and Pantel, 2001; Bhagat et al., 2007; Yates and Etzioni, 2009), which differ in terms of the specifics of the ways in which predicates are represented, the features that are extracted, and the function used to compute feature vector similarity. |
Experimental Evaluation | When computing distributional similarity scores, a template is represented as a feature vector of the CUIs that instantiate its arguments. |
Learning Entailment Graph Edges | Next, we represent each pair of propositional templates with a feature vector of various distributional similarity scores. |
Learning Entailment Graph Edges | A template pair is represented by a feature vector where each coordinate is a different distributional similarity score. |
Learning Entailment Graph Edges | Another variant occurs when using binary templates: a template may be represented by a pair of feature vectors , one for each variable (Lin and Pantel, 2001), or by a single vector, where features represent pairs of instantiations (Szpektor et al., 2004; Yates and Etzioni, 2009). |
Relationship Classification | To do this, we construct feature vectors from each training pair, where each feature is the HITS measure corresponding to a single pattern cluster. |
Relationship Classification | Once we have feature vectors , we can use a variety of classifiers (we used those in Weka) to construct a model and to evaluate it on the test set. |
Relationship Classification | If we are not given any training set, it is still possible to separate between different relationship types by grouping the feature vectors of Section 4.3.2 into clusters. |
Approach | where 6 6 Rd is the parameter vector and gb(:c, w, z) is the feature vector , which will be defined in Section 3.3. |
Approach | To construct the log-linear model, we define a feature vector gb(:c, w, z) for each query at, web page 212, and extraction predicate z. |
Approach | The final feature vector is the concatenation of structural features gbs(w,z), which consider the selected nodes in the DOM tree, and denotation features gbd(:c, y), which look at the extracted entities. |
Introduction | It is impractical to enumerate all the mentions in an entity and record their information in a single feature vector , as it would make the feature space too large. |
Introduction | Even worse, the number of mentions in an entity is not fixed, which would result in variant-length feature vectors and make trouble for normal machine learning algorithms. |
Modelling Coreference Resolution | As an entity may contain more than one candidate and the number is not fixed, it is impractical to enumerate all the mentions in an entity and put their properties into a single feature vector . |
Related Work | In the system, a training or testing instance is formed for two mentions in question, with a feature vector describing their properties and relationships. |
Baseline parser | Here (I) (ai) represents the feature vector for the it}, action a, in state item 04. |
Improved hypotheses comparison | The significant variance in the number of actions N can have an impact on the linear separability of state items, for which the feature vectors are 21-111 (I) (ai). |
Improved hypotheses comparison | A feature vector is extracted for the IDLE action according to the final state context, in the same way as other actions. |
Improved hypotheses comparison | corresponding feature vectors have about the same sizes, and are more linearly separable. |
Learning QA Matching Models | Given a word pair (wq,w8), where mg 6 Vq and ws 6 V8, feature functions o1, - -- ,gbd map it to a d-dimensional real-valued feature vector . |
Learning QA Matching Models | We consider two aggregate functions for defining the feature vectors of the whole ques-tiorflanswer pair: average and max. |
Learning QA Matching Models | (Dmaaj (Q7 3) : £11252 ij (wqa we) (2) wsEVs Together, each questiorflsentence pair is represented by a 2d-dimensional feature vector . |
Our Approach | Then, we use a uniform automatic method, which primarily consists of word labeling and feature vector generation, to generate the training data set TD 2 {(55, from these collected articles. |
Our Approach | (1) After the word labeling, each instance (word/token) is represented as a feature vector . |
Our Approach | First, we turn the team: into a word sequence and compute the feature vector for each word based on the feature definition in Section 3.1. |
Preliminaries | :10, can be represented as a feature vector according to its context. |
Online Question Generation | We next describe the various features we extract for every entity and the supervised models that given this feature vector representation assess the correctness of an instantiation. |
Online Question Generation | The feature vector of each named entity was induced as described in Section 4.2.1. |
Online Question Generation | To generate features for a candidate pair, we take the two feature vectors of the two entities and induce families of pair features by comparing between the two vectors. |
Discussion | This may be due to the smaller number of feature vectors used in the experiments (only 80, as compared to the 412 used in the previous setup). |
Discussion | Another possible reason is the fact that the acoustic and visual modalities are significantly weaker than the linguistic modality, most likely due to the fact that the feature vectors are now speaker-independent, which makes it harder to improve over the linguistic modality alone. |
Experiments and Results | In this approach, the features collected from all the multimodal streams are combined into a single feature vector , thus resulting in one vector for each utterance in the dataset which is used to make a decision about the sentiment orientation of the utterance. |
Multimodal Sentiment Analysis | The features are averaged over all the frames in an utterance, to obtain one feature vector for each utterance. |
Evaluation | Before the clustering process takes place, Web snippets are represented as word feature vectors . |
Evaluation | In particular, p is the size of the word feature vectors representing both Web snippets and centroids (p = 2.5), K is the number of clusters to be found (K = 2..10) and 8(Wik, 14/31) is the collocation measure integrated in the InfoSimba similarity measure. |
Introduction | On the other hand, the polythetic approach which main idea is to represent Web snippets as word feature vectors has received less attention, the only relevant work being (Osinski and Weiss, 2005). |
Introduction | feature vectors are hard to define in small collections of short text fragments (Timonen, 2013), (2) existing second-order similarity measures such as the cosine are unadapted to capture the semantic similarity between small texts, (3) Latent Semantic Analysis has evidenced inconclusive results (Osinski and Weiss, 2005) and (4) the labeling process is a surprisingly hard extra task (Carpineto et al., 2009). |
Experiments | Algorithms 2 and 3 were infeasible to run on Europarl data beyond one epoch because features vectors grew too large to be kept in memory. |
Introduction | The simple but effective idea is to randomly divide training data into evenly sized shards, use stochastic learning on each shard in parallel, while performing 61/62 regularization for joint feature selection on the shards after each epoch, before starting a new epoch with a reduced feature vector averaged across shards. |
Joint Feature Selection in Distributed Stochastic Learning | Let each translation candidate be represented by a feature vector x 6 RD where preference pairs for training are prepared by sorting translations according to smoothed sentence-wise BLEU score (Liang et al., 2006a) against the reference. |
Joint Feature Selection in Distributed Stochastic Learning | Parameter mixing by averaging will help to ease the feature sparsity problem, however, keeping feature vectors on the scale of several million features in memory can be prohibitive. |
Inferring a learning curve from mostly monolingual data | The feature vector qb consists of the following features: |
Inferring a learning curve from mostly monolingual data | We construct the design matrix (I) with one column for each feature vector qbct corresponding to each combination of training configuration 0 and test set If. |
Inferring a learning curve from mostly monolingual data | For a new unseen configuration with feature vector gbu, we determine the parameters 6., of the corresponding learning curve as: |
Distributional semantic models | From every image in a dataset, relevant areas are identified and a low-level feature vector (called a “descriptor”) is built to represent each area. |
Distributional semantic models | Now, given a new image, the nearest visual word is identified for each descriptor extracted from it, such that the image can be represented as a BoVW feature vector , by counting the instances of each visual word in the image (note that an occurrence of a low-level descriptor vector in an image, after mapping to the nearest cluster, will increment the count of a single dimension of the higher-level BoVW vector). |
Distributional semantic models | We extract descriptor features of two types.6 First, the standard Scale-Invariant Feature Transform (SIFT) feature vectors (Lowe, 1999; Lowe, 2004), good at characterizing parts of objects. |
Clustering phrase pairs directly using the K-means algorithm | We thus propose to represent each phrase pair instance (including its bilingual one-word contexts) as feature vectors , i.e., points of a vector space. |
Clustering phrase pairs directly using the K-means algorithm | then use these data points to partition the space into clusters, and subsequently assign each phrase pair instance the cluster of its corresponding feature vector as label. |
Clustering phrase pairs directly using the K-means algorithm | In the same fashion, we can incorporate multiple tagging schemes (e.g., word clusterings of different gran-ularities) into the same feature vector . |
Approximate Dynamic Programming | Furthermore, as the size of the feature vector K increases, the space becomes even more difficult to search. |
Reinforcement Learning Formulation | Thus, we represent state/action pairs with a feature vector gb(s, a) E RK. |
Reinforcement Learning Formulation | Learning exactly which words influence decision making is difficult; reinforcement learning algorithms have problems with the large, sparse feature vectors common in natural language processing. |
Reinforcement Learning Formulation | For a given state 3 = (u, l, c) and action a = (l’, 0’), our feature vector gb(s, a) is composed of the following: |
Results and Discussion | Then (Einstein,he), (Hawking,he) and (Novoselov, he) will all be assigned the feature vector <l, No, Proper Noun, Personal Pronoun, Yes>. |
Results and Discussion | Using the same representation of pairs, suppose that for the sequence of markables Biden, Obama, President the markable pairs (Biden,President) and (0bama,President) are assigned the feature vectors <8, No, Proper Noun, Proper Noun, Yes> and <l, No, Proper Noun, Proper Noun, Yes>, respectively. |
Results and Discussion | with the second feature vector (distance=l) as coreferent than with the first one (distance=8) in the entire automatically labeled training set. |
System Architecture | After filtering, we then calculate a feature vector for each generated pair that survived filters (i)—(iv). |
Bilingual Tree Kernels | In order to compute the dot product of the feature vectors in the exponentially high dimensional feature space, we introduce the tree kernel functions as follows: |
Bilingual Tree Kernels | It is infeasible to explicitly compute the kernel function by expressing the sub-trees as feature vectors . |
Introduction | In addition, explicitly utilizing syntactic tree fragments results in exponentially high dimensional feature vectors , which is hard to compute. |
Substructure Spaces for BTKs | The feature vector of the classifier is computed using a composite kernel: |
Abstract | The results of these experiments were not particularly strong, likely owing to the increased sparsity of the feature vectors . |
Abstract | Binning: Next, we wished to explore longer n-grams of words or POS tags and to reduce the sparsity of the feature vectors . |
Abstract | Self-Training: Besides sparse feature vectors , another factor likely to be hurting our classifier was the limited amount of training data. |
Incorporating Structural Syntactic Information | Thus, it is computational infeasible to directly use the feature vector (MT). |
The Recognition Framework | Suppose the training set S consists of labeled vectors {(xi, 311)}, where xi is the feature vector |
The Recognition Framework | where a,- is the learned parameter for a feature vector xi, and b is another parameter which can be derived from a,- . |
BBC News Database | Secondly, the generation of feature vectors is modeled directly, so there is no need for quantization. |
BBC News Database | where NV] is the number of regions in image I , vr the feature vector for region r in image I , nsv the number of regions in the image of latent variable 5, v,- the feature vector for region i in 5’s image, k the dimension of the image feature vectors and Z the feature covariance matrix. |
BBC News Database | According to equation (3), a Gaussian kernel is fit to every feature vector v,- corresponding to region i in the image of the latent variable 5. |
Introduction | Feature Vector |
Introduction | where Ln is its phrase label (i.e., its Treebank tag), and F7, is a feature vector which indicates the characteristics of node n, which is represented as: |
Introduction | where 6 (t1, t2) is the same indicator function as in CTK; (m, nj)is a pair of aligned nodes between 151 and t2, where m and nj are correspondingly in the same position of tree t1 and t2; E (t1, 752) is the set of all aligned node pairs; sim(n,-, nj) is the feature vector similarity between nodeni and nj, computed as the dot product between their feature vectors Fm and Fnj. |
Relational Similarity Experiments | Given a verbal analogy example, we build six feature vectors — one for each of the six word pairs. |
Relational Similarity Experiments | For the evaluation, we created a feature vector for each head-modifier pair, and we performed a leave-one-out cross-validation: we left one example for testing and we trained on the remaining 599 ones, repeating this procedure 600 times so that each example be used for testing. |
Relational Similarity Experiments | We calculated the similarity between the feature vector of the testing example and each of the training examples’ vectors. |
Experiments | negative transfer from irrelevant sources by relying on similarity of feature vectors between source and target domains based on labeled and unlabeled data. |
Problem Statement | We introduce a feature extraction x that maps the triple (A,B,S) to its feature vector x. |
Problem Statement | :1 and plenty of unlabeled data Du = where n; and nu are the number of labeled and unlabeled samples respectively, x,- is the feature vector , yi is the corresponding label (if available). |
Model | encode a tweet-level feature vector rather than an aggregate one. |
Model | (3) The feature vector wtwfizfje, Xi) encodes the following standard general features: |
Related Work | In the user attribute extraction literature, researchers have considered neighborhood context to boost inference accuracy (Pennacchiotti and Popescu, 2011; Al Zamal et al., 2012), where information about the degree of their connectivity to their pre-labeled users is included in the feature vectors . |
Background | Distributional similarity algorithms differ in their feature representation: Some use a binary representation: each predicate is represented by one feature vector where each feature is a pair of arguments (Szpektor et al., 2004; Yates and Etzioni, 2009). |
Learning Typed Entailment Graphs | 2) Feature representation Each example pair of predicates (191,192) is represented by a feature vector , where each feature is a specific distributional |
Learning Typed Entailment Graphs | We want to use Pm, to derive the posterior P(G|F), where F = Uu¢vFuv and Fm, is the feature vector for a node pair (u, 2)). |
Our Approach | ,gc are the composition functions, P (9;, |vl, VT, e) is the probability of employing 9;, given the child vectors vl, VT and external feature vector e, and f is the nonlinearity function. |
Our Approach | where 6 is the hyper-parameter, S E ROXQDHGI) is the matrix used to determine which composition function we use, vl, VT are the left and right child vectors, and e are external feature vector . |
Our Approach | In this work, e is a one-hot binary feature vector which indicates what the dependency type is. |
Empirical Evaluation | Each row is a category and each column represents a feature vector . |
Empirical Evaluation | While the actual size of vocabulary is huge, we use only a small subset of words in our feature vector for this visualization. |
Intervention Prediction Models | Assuming that p represents posts of thread 75, h represents the latent category assignments, 7“ represents the intervention decision; feature vector , qb(p, 7“, h, t), is extracted for each thread and using the weight vector, w, this model defines a decision function, similar to what is shown in Equation 1. |
Brain Imaging Experiments on Adj ec-tive-Noun Comprehension | The regression model examined to what extent the semantic feature vectors (explanatory variables) can account for the variation in neural activity (response variable) across the 12 stimuli. |
Brain Imaging Experiments on Adj ec-tive-Noun Comprehension | Table 5 also supports our hypothesis that the multiplicative model should outperform the additive model, based on the assumption that adjectives are used to emphasize particular semantic features that will already be represented in the semantic feature vector of the noun. |
Brain Imaging Experiments on Adj ec-tive-Noun Comprehension | We are currently exploring the infinite latent semantic feature model (ILFM; Griffiths & Ghahramani, 2005), which assumes a nonparametric Indian Buffet prior to the binary feature vector and models neural activation with a linear Gaussian model. |
Estimating the Tensor Model | We assume a function u that maps outside trees 0 to feature vectors 2M0) 6 Rd]. |
Estimating the Tensor Model | For example, the feature vector might track the rule directly above the node in question, the word following the node in question, and so on. |
Estimating the Tensor Model | We also assume a function gb that maps inside trees 75 to feature vectors gb(t) 6 Rd. |
Cross-Language Text Classification | In standard text classification, a document d is represented under the bag-of-words model as |V|-dimensional feature vector x E X, where V, the vocabulary, denotes an ordered set of words, :ci 6 x denotes the normalized frequency of word 2' in d, and X is an inner product space. |
Cross-Language Text Classification | D3 denotes the training set and comprises tuples of the form (X, 3/), which associate a feature vector x E X with a class label 3/ E Y. |
Experiments | A document d is described as normalized feature vector x under a unigram bag-of-words document representation. |
Our Method | 5: for Each word w E t d0 6: Get the feature vector ID: 13 = reprw(w, t). |
Our Method | 10: end if 11: end for 12: Get the feature vector t: 7? |
Our Method | 4: Get the feature vector ID: 13 = reprw (w, t). |
Introduction | A regression learning method is used to infer a function that maps a feature vector (which measures the similarity of a translation to the pseudo references) to a score that indicates the quality of the translation. |
Translation Selection | The regression objective is to infer a function that maps a feature vector (which measures the similarity of a translation from one system to the pseudo references) to a score that indicates the quality of the translation. |
Translation Selection | The input sentence is represented as a feature vector X, which are extracted from the input sentence and the comparisons against the pseudo references. |
Experiments | We used CRFs-based Japanese dependency parser (Imamura et al., 2007) and named entity recognizer (Suzuki et al., 2006) for sentiment extraction and constructing feature vectors for readability score, respectively. |
Optimizing Sentence Sequence | where, given two adjacent sentences 3,- and 3,41, ngb(sz-, 3H1), which measures the connectivity of the two sentences, is the inner product of w and gb(si, 3H1), w is a parameter vector and gb(si, 3141) is a feature vector of the two sentences. |
Optimizing Sentence Sequence | We also define feature vector (13(8) of the entire sequence 8 = (so, 31, . |
Evaluation | We represent feature vectors exactly as described in Section 3.3. |
Methodology | 3.3 Feature Vector Representation |
Methodology | We can achieve these aims by ordering the counts in a feature vector , and using a labelled set of training examples to learn a classifier that optimally weights the counts. |
Method | A fundamental assumption underlying our model is that this bipartite graph contains the entity transition information needed for local coherence computation, rendering feature vectors and learning phase unnecessary. |
The Entity Grid Model | To make this representation accessible to machine learning algorithms, Barzilay and Lapata (2008) compute for each document the probability of each transition and generate feature vectors representing the sentences. |
The Entity Grid Model | (2011) use discourse relations to transform the entity grid representation into a discourse role matrix that is used to generate feature vectors for machine learning algorithms similarly to Barzilay and Lapata (2008). |
Gaussian Process Regression | In our regression task3 the data consists of n pairs D = {(xi,yi)}, where x,- 6 RF is a F-dimensional feature vector and y,- E R is the response variable. |
Gaussian Process Regression | Each instance is a translation and the feature vector encodes its linguistic features; the response variable is a numerical quality judgement: post editing time or likert score. |
Gaussian Process Regression | GP regression assumes the presence of a latent function, f : RF —> R, which maps from the input space of feature vectors x to a scalar. |
Selectional branching | For each parsing state sij, a prediction is made by generating a feature vector xij E X , feeding it into a classifier C1 that uses a feature map (19(53, y) and a weight vector w to measure a score for each label y E y, and choosing a label with the highest score. |
Selectional branching | During training, a training instance is generated for each parsing state sij by taking a feature vector xij and its true label yij. |
Selectional branching | Then, a subgradient is measured by taking all feature vectors together weighted by Q (line 6). |
Introduction | We tabulate the transitions of entities between different syntactic positions (or their nonoccurrence) in sentences, and convert the frequencies of transitions into a feature vector representation of transition probabilities in the document. |
Introduction | We solve this problem in a supervised machine learning setting, where the input is the feature vector representations of the two versions of the document, and the output is a binary value indicating the document with the original sentence ordering. |
Introduction | Transition length — the maximum length of the transitions used in the feature vector representation of a document. |