Abstract | In a straightforward bag-of-words experimental setup we add etymological ancestors of the words in the documents, and investigate the performance of a model built on English data, on Italian test data (and viceversa). |
Abstract | The results show not only statistically significant, but a large improvement — a jump of almost 40 points in Fl-score — over the raw (vanilla bag-of-words ) representation. |
Cross Language Text Categorization | The most frequently, and successfully, used document representation is the bag-of-words (BoWs). |
Cross Language Text Categorization | As is commonly done in text categorization (Sebastiani, 2005), the documents in our data are represented as bag-of-words , and classification is done using support vector machines (SVMs). |
Cross Language Text Categorization | The bag-of-words representation for each document is expanded with the corresponding etymological features. |
Discussion | Feature filtering is commonly done in machine learning when the data has many features, and in text categorization when using the bag-of-words representation in particular. |
Discussion | The difference in results on the two dictionary versions was significant: a 4 and 5 points increase respectively in micro-averaged Fl-score in the bag-of-words setting for English trainingfltalian testing and Italian trainingflEnglish testing, and a 2 and 6 points increase in the LSA setting. |
Introduction | We start with the basic setup, representing the documents as bag-of-words , where we train a model on the English training data, and use this model to categorize documents from the Italian test data (and viceversa). |
Introduction | We then add the etymological roots of the words in the data to the bag-of-words , and notice a large — 21 points — increase in performance in terms of Fl-score. |
Introduction | We then use the bag-of-words representation of the training data to build a semantic space using LSA, and use the generated word vectors to represent the training and test data. |
Conclusions | Following the word-alignment paradigm, we find that the rich lexical semantic information improves the models consistently in the unstructured bag-of-words setting and also in the framework of learning latent structures. |
Experiments | For the unstructured, bag-of-words setting, we tested logistic regression (LR) and boosted decision trees (BDT). |
Introduction | Due to the variety of word choices and inherent ambiguities in natural languages, bag-of-words approaches with simple surface-form word matching tend to produce brittle results with poor prediction accuracy (Bilotti et al., 2007). |
Learning QA Matching Models | In this section, we investigate the effectiveness of various learning models for matching questions and sentences, including the bag-of-words setting |
Learning QA Matching Models | 5.1 Bag-of-Words Model |
Learning QA Matching Models | The bag-of-words model treats each question and sentence as an unstructured bag of words. |
Problem Definition | For instance, if we assume a naive complete bipartite matching, then effectively it reduces to the simple bag-of-words model. |
Related Work | Observing the limitations of the bag-of-words models, Wang et al. |
Experiments | Experiments evaluate the FWD and SemTree feature spaces compared to two baselines: bag-of-words (BOW) and supervised latent Dirichlet allocation (sLDA) (Blei and McAuliffe, 2007). |
Introduction | Our main contribution is a novel tree representation based on semantic frame parses that performs significantly better than enriched bag-of-words vectors. |
Introduction | On the polarity task, the semantic frame features encoded as trees perform significantly better across years and sectors than bag-of-words vectors (BOW), and outperform BOW vectors enhanced with semantic frame features, and a supervised topic modeling approach. |
Methods | Table 1 lists 24 types of features, including semantic Frame attributes, bag-of-Words , and scores for words in the Dictionary of Affect in Language by part of speech (pDAL). |
Methods | Bag-of-Words features include term frequency and tfidf of unigrams, bigrams, and trigrams. |
Motivation | Bag-of-Words (BOW) document representation is difficult to surpass for many document classification tasks, but cannot capture the degree of semantic similarity among these sentences. |
Related Work | NLP has recently been applied to financial text for market analysis, primarily using bag-of-words (BOW) document representation. |
Related Work | Table 1: FWD features (Frame, bag-of-Words , part-of-speech DAL score) and their value types. |
Abstract | Representing reports according to their topic distributions is more compact than bag-of-words representation and can be processed faster than raw text in subsequent automated processes. |
Background | 2.1 Bag-of-Words (BOW) Representation |
Background | One way of doing this is bag-of-words (BoW) representation where each document becomes a vector of its words/tokens. |
Conclusion | Firstly, bag-of-words representation is replaced with topic vectors which provide good dimensionality reduction and still get comparable classification performance. |
Abstract | Second, by going beyond a bag-of-words approach, it takes into account the inherent sequential nature of utterances to learn semantic classes based on context. |
Experiments | Similarly, if no Markov properties are used ( bag-of-words ), MTR reduces to w—LDA. |
Related Work and Motivation | Standard topic models, such as Latent Dirichlet Allocation (LDA) (Blei et al., 2003), use a bag-of-words approach, which disregards word order and clusters words together that appear in a similar global context. |
Approach | We use the document’s binary bag-of-words vector vj, and compute the document’s vector space representation through the matrix-vector product (Dij. |
Results and Discussion | Features Number of training examples + Bag-of-words features .5K 5K 20K | .5K 5K 20K |
Results and Discussion | Additional features: Across all embeddings, appending the document’s binary bag-of-words representation increases classification accuracy. |
Abstract | Following a probabilistic decipherment approach, we first introduce a new framework for decipherment training that is flexible enough to incorporate any number/type of features (besides simple bag-of-words ) as side-information used for estimating translation models. |
Bayesian MT Decipherment via Hash Sampling | Firstly, we would like to include as many features as possible to represent the source/target words in our framework besides simple bag-of-words context similarity (for example, left-context, right-context, and other general-purpose features based on topic models, etc.). |
Introduction | Secondly, we introduce a new feature-based representation for sampling translation candidates that allows one to incorporate any amount of additional features (beyond simple bag-of-words ) as side-information during decipherment training. |