Abstract | Semi-supervised learning (SSL) methods augment standard machine learning (ML) techniques to leverage unlabeled data . |
Abstract | However, existing SSL techniques typically require multiple passes over the entirety of the unlabeled data , meaning the techniques are not applicable to large corpora being produced today. |
Abstract | In this paper, we show that improving marginal word frequency estimates using unlabeled data can enable semi-supervised text classification that scales to massive unlabeled data sets. |
Introduction | Semi-supervised Learning (SSL) is a Machine Learning (ML) approach that utilizes large amounts of unlabeled data , combined with a smaller amount of labeled data, to learn a target function (Zhu, 2006; Chapelle et al., 2006). |
Introduction | Typically, for each target concept to be learned, a semi-supervised classifier is trained using iterative techniques that execute multiple passes over the unlabeled data (e. g., Expectation-Maximization (Nigam et al., 2000) or Label Propagation (Zhu and Ghahramani, 2002)). |
Introduction | Instead of utilizing unlabeled examples directly for each given target concept, our approach is to pre-compute a small set of statistics over the unlabeled data in advance. |
Problem Definition | MNB-FM attempts to improve MNB’s estimates of 6:; and 6,; , using statistics computed over the unlabeled data . |
Problem Definition | Equation 2 can be estimated in advance, Without knowledge of the target class, simply by counting the number of tokens of each word in the unlabeled data . |
Problem Definition | The above example illustrates how MNB-FM can leverage frequency marginal statistics computed over unlabeled data to improve MNB’s conditional probability estimates. |
Abstract | One constructs a nearest-neighbor similarity graph over all trigrams of labeled and unlabeled data for propagating syntactic information, i.e., label distributions. |
Abstract | The derived label distributions are regarded as virtual evidences to regularize the learning of linear conditional random fields (CRFs) on unlabeled data . |
Introduction | Therefore, semi-supervised joint S&T appears to be a natural solution for easily incorporating accessible unlabeled data to improve the joint S&T model. |
Introduction | Motivated by the works in (Subramanya et al., 2010; Das and Smith, 2011), for structured problems, graph-based label propagation can be employed to infer valuable syntactic information (n-gram-level label distributions) from labeled data to unlabeled data . |
Introduction | labeled and unlabeled data to achieve the semi-supervised learning. |
Method | The emphasis of this work is on building a joint S&T model based on two different kinds of data sources, labeled and unlabeled data . |
Method | In essence, this learning problem can be treated as incorporating certain gainful information, e.g., prior knowledge or label constraints, of unlabeled data into the supervised model. |
Related Work | Sun and Xu (2011) enhanced a CWS model by interpolating statistical features of unlabeled data into the CRFs model. |
Related Work | (2011) proposed a semi-supervised pipeline S&T model by incorporating n-gram and lexicon features derived from unlabeled data . |
Related Work | Different from their concern, our emphasis is to learn the semi-supervised model by injecting the label information from a similarity graph constructed from labeled and unlabeled data . |
Abstract | Similarly to multi-view learning, the “segmentation agreements” between the two different types of view are used to overcome the scarcity of the label information on unlabeled data . |
Abstract | The agreements are regarded as a set of valuable constraints for regularizing the learning of both models on unlabeled data . |
Introduction | Sun and Xu (2011) enhanced the segmentation results by interpolating the statistics-based features derived from unlabeled data to a CRFs model. |
Introduction | The crux of solving semi-supervised learning problem is the learning on unlabeled data . |
Introduction | Inspired by multi-view learning that exploits redundant views of the same input data (Ganchev et al., 2008), this paper proposes a semi-supervised CWS model of co-regularizing from two different views (intrinsically two different models), character-based and word-based, on unlabeled data . |
Semi-supervised Learning via Co-regularizing Both Models | As mentioned earlier, the primary challenge of semi-supervised CWS concentrates on the unlabeled data . |
Semi-supervised Learning via Co-regularizing Both Models | Obviously, the learning on unlabeled data does not come for “free”. |
Semi-supervised Learning via Co-regularizing Both Models | Very often, it is necessary to discover certain gainful information, e.g., label constraints of unlabeled data , that is incorporated to guide the learner toward a desired solution. |
Experiments | As for unlabeled data we crawled the web and collected around 100,000 questions that are similar in style and length to the ones in QuestionBank, e.g. |
Experiments | A CRF model is used to decode the unlabeled data to generate more labeled examples for retraining. |
Experiments | smooth the semantic tag posteriors of a unlabeled data decoded by the CRF model using Eq. |
Introduction | To the best of our knowledge, our work is the first to explore the unlabeled data to iteratively adapt the semantic tagging models for target domains, preserving information from the previous iterations. |
Related Work and Motivation | Adapting the source domain using unlabeled data is the key to achieving good performance across domains. |
Semi-Supervised Semantic Labeling | The last term is the loss on unlabeled data from target domain with a hyper-parameter 7'. |
Semi-Supervised Semantic Labeling | After we decode the unlabeled data , we retrain a new CRF model at each iteration. |
Semi-Supervised Semantic Labeling | Each iteration makes predictions on the semantic tags of unlabeled data with varying posterior probabilities. |
Abstract | This paper addresses the problem of dealing with a collection of labeled training documents, especially annotating negative training documents and presents a method of text classification from positive and unlabeled data . |
Conclusion | The research described in this paper involved text classification using positive and unlabeled data . |
Experiments | The rest of the positive and negative documents are used as unlabeled data . |
Experiments | The number of positive training data in other three methods depends on the value of 6, and the rest of the positive and negative documents were used as unlabeled data . |
Experiments | Our goal is to achieve classification accuracy from only positive documents and unlabeled data as high as that from labeled positive and negative data. |
Framework of the System | First, we randomly select documents from unlabeled data (U) where the number of documents is equal to that of the initial positive training documents (Pl). |
Framework of the System | For the result of correction (R001, we train SVM classifiers, and classify the remaining unlabeled data (U \ N1). |
Introduction | Several authors have attempted to improve classification accuracy using only positive and unlabeled data (Yu et al., 2002; Ho et al., 2011). |
Introduction | Our goal is to eliminate the need for manually collecting training documents, and hopefully achieve classification accuracy from positive and unlabeled data as high as that from labeled positive and labeled negative data. |
Introduction | Like much previous work on semi-supervised ML, we apply SVM to the positive and unlabeled data , and add the classification results to the training data. |
Experiment | To keep the experiment tractable, we first randomly choose 50,000 of all the texts as unlabeled data , which contain 2,420,037 characters. |
Experiment | We also experimented on different size of unlabeled data to evaluate the performance when adding unlabeled target domain data. |
Experiment | TABLE 5 shows different f-scores and OOV—Recalls on different unlabeled data set. |
Implementation Details of Multitask Learning Method | BLLIP North American News Text (Complete) is used as unlabeled data source to generate synthetic labeled data. |
Related Work | Due to the lack of benchmark data for implicit discourse relation analysis, earlier work used unlabeled data to generate synthetic implicit discourse data. |
Related Work | Research work in this category exploited both labeled and unlabeled data for discourse relation prediction. |
Related Work | (Hernault et al., 2010) presented a semi-supervised method based on the analysis of co-occurring features in labeled and unlabeled data . |
Abstract | Starting with a domain-independent, high-precision sentiment lexicon and a large pool of unlabeled data , we bootstrap Twitter-specific sentiment lexicons, using a small amount of labeled data to guide the process. |
Lexicon Bootstrapping | To create a Twitter-specific sentiment lexicon for a given language, we start with a general-purpose, high-precision sentiment lexicon2 and bootstrap from the unlabeled data (BOOT) using the labeled development data (DEV) to guide the process. |
Lexicon Bootstrapping | On each iteration i 2 1, tweets in the unlabeled data are labeled using the lexicon |
Related Work | Corpus-based methods extract subjectivity and sentiment lexicons from large amounts of unlabeled data using different similarity metrics to measure the relatedness between words. |
Abstract | Recently, with an increase in computing resources, it became possible to learn rich word embeddings from massive amounts of unlabeled data . |
Introduction | Traditionally, we would learn the embeddings for the target task jointly with whatever unlabeled data we may have, in an instance of semi-supervised learning, and/or we may leverage labels from multiple other related tasks in a multitask approach. |
Related Work | Our method is different in that the (potentially) massive amount of unlabeled data is not required a-priori, but only the resultant embedding. |