Index of papers in Proc. ACL 2013 that mention
  • unlabeled data
Lucas, Michael and Downey, Doug
Abstract
Semi-supervised learning (SSL) methods augment standard machine learning (ML) techniques to leverage unlabeled data .
Abstract
However, existing SSL techniques typically require multiple passes over the entirety of the unlabeled data , meaning the techniques are not applicable to large corpora being produced today.
Abstract
In this paper, we show that improving marginal word frequency estimates using unlabeled data can enable semi-supervised text classification that scales to massive unlabeled data sets.
Introduction
Semi-supervised Learning (SSL) is a Machine Learning (ML) approach that utilizes large amounts of unlabeled data , combined with a smaller amount of labeled data, to learn a target function (Zhu, 2006; Chapelle et al., 2006).
Introduction
Typically, for each target concept to be learned, a semi-supervised classifier is trained using iterative techniques that execute multiple passes over the unlabeled data (e. g., Expectation-Maximization (Nigam et al., 2000) or Label Propagation (Zhu and Ghahramani, 2002)).
Introduction
Instead of utilizing unlabeled examples directly for each given target concept, our approach is to pre-compute a small set of statistics over the unlabeled data in advance.
Problem Definition
MNB-FM attempts to improve MNB’s estimates of 6:; and 6,; , using statistics computed over the unlabeled data .
Problem Definition
Equation 2 can be estimated in advance, Without knowledge of the target class, simply by counting the number of tokens of each word in the unlabeled data .
Problem Definition
The above example illustrates how MNB-FM can leverage frequency marginal statistics computed over unlabeled data to improve MNB’s conditional probability estimates.
unlabeled data is mentioned in 25 sentences in this paper.
Topics mentioned in this paper:
Zeng, Xiaodong and Wong, Derek F. and Chao, Lidia S. and Trancoso, Isabel
Abstract
One constructs a nearest-neighbor similarity graph over all trigrams of labeled and unlabeled data for propagating syntactic information, i.e., label distributions.
Abstract
The derived label distributions are regarded as virtual evidences to regularize the learning of linear conditional random fields (CRFs) on unlabeled data .
Introduction
Therefore, semi-supervised joint S&T appears to be a natural solution for easily incorporating accessible unlabeled data to improve the joint S&T model.
Introduction
Motivated by the works in (Subramanya et al., 2010; Das and Smith, 2011), for structured problems, graph-based label propagation can be employed to infer valuable syntactic information (n-gram-level label distributions) from labeled data to unlabeled data .
Introduction
labeled and unlabeled data to achieve the semi-supervised learning.
Method
The emphasis of this work is on building a joint S&T model based on two different kinds of data sources, labeled and unlabeled data .
Method
In essence, this learning problem can be treated as incorporating certain gainful information, e.g., prior knowledge or label constraints, of unlabeled data into the supervised model.
Related Work
Sun and Xu (2011) enhanced a CWS model by interpolating statistical features of unlabeled data into the CRFs model.
Related Work
(2011) proposed a semi-supervised pipeline S&T model by incorporating n-gram and lexicon features derived from unlabeled data .
Related Work
Different from their concern, our emphasis is to learn the semi-supervised model by injecting the label information from a similarity graph constructed from labeled and unlabeled data .
unlabeled data is mentioned in 33 sentences in this paper.
Topics mentioned in this paper:
Zeng, Xiaodong and Wong, Derek F. and Chao, Lidia S. and Trancoso, Isabel
Abstract
Similarly to multi-view learning, the “segmentation agreements” between the two different types of view are used to overcome the scarcity of the label information on unlabeled data .
Abstract
The agreements are regarded as a set of valuable constraints for regularizing the learning of both models on unlabeled data .
Introduction
Sun and Xu (2011) enhanced the segmentation results by interpolating the statistics-based features derived from unlabeled data to a CRFs model.
Introduction
The crux of solving semi-supervised learning problem is the learning on unlabeled data .
Introduction
Inspired by multi-view learning that exploits redundant views of the same input data (Ganchev et al., 2008), this paper proposes a semi-supervised CWS model of co-regularizing from two different views (intrinsically two different models), character-based and word-based, on unlabeled data .
Semi-supervised Learning via Co-regularizing Both Models
As mentioned earlier, the primary challenge of semi-supervised CWS concentrates on the unlabeled data .
Semi-supervised Learning via Co-regularizing Both Models
Obviously, the learning on unlabeled data does not come for “free”.
Semi-supervised Learning via Co-regularizing Both Models
Very often, it is necessary to discover certain gainful information, e.g., label constraints of unlabeled data , that is incorporated to guide the learner toward a desired solution.
unlabeled data is mentioned in 17 sentences in this paper.
Topics mentioned in this paper:
Celikyilmaz, Asli and Hakkani-Tur, Dilek and Tur, Gokhan and Sarikaya, Ruhi
Experiments
As for unlabeled data we crawled the web and collected around 100,000 questions that are similar in style and length to the ones in QuestionBank, e.g.
Experiments
A CRF model is used to decode the unlabeled data to generate more labeled examples for retraining.
Experiments
smooth the semantic tag posteriors of a unlabeled data decoded by the CRF model using Eq.
Introduction
To the best of our knowledge, our work is the first to explore the unlabeled data to iteratively adapt the semantic tagging models for target domains, preserving information from the previous iterations.
Related Work and Motivation
Adapting the source domain using unlabeled data is the key to achieving good performance across domains.
Semi-Supervised Semantic Labeling
The last term is the loss on unlabeled data from target domain with a hyper-parameter 7'.
Semi-Supervised Semantic Labeling
After we decode the unlabeled data , we retrain a new CRF model at each iteration.
Semi-Supervised Semantic Labeling
Each iteration makes predictions on the semantic tags of unlabeled data with varying posterior probabilities.
unlabeled data is mentioned in 10 sentences in this paper.
Topics mentioned in this paper:
Fukumoto, Fumiyo and Suzuki, Yoshimi and Matsuyoshi, Suguru
Abstract
This paper addresses the problem of dealing with a collection of labeled training documents, especially annotating negative training documents and presents a method of text classification from positive and unlabeled data .
Conclusion
The research described in this paper involved text classification using positive and unlabeled data .
Experiments
The rest of the positive and negative documents are used as unlabeled data .
Experiments
The number of positive training data in other three methods depends on the value of 6, and the rest of the positive and negative documents were used as unlabeled data .
Experiments
Our goal is to achieve classification accuracy from only positive documents and unlabeled data as high as that from labeled positive and negative data.
Framework of the System
First, we randomly select documents from unlabeled data (U) where the number of documents is equal to that of the initial positive training documents (Pl).
Framework of the System
For the result of correction (R001, we train SVM classifiers, and classify the remaining unlabeled data (U \ N1).
Introduction
Several authors have attempted to improve classification accuracy using only positive and unlabeled data (Yu et al., 2002; Ho et al., 2011).
Introduction
Our goal is to eliminate the need for manually collecting training documents, and hopefully achieve classification accuracy from positive and unlabeled data as high as that from labeled positive and labeled negative data.
Introduction
Like much previous work on semi-supervised ML, we apply SVM to the positive and unlabeled data , and add the classification results to the training data.
unlabeled data is mentioned in 10 sentences in this paper.
Topics mentioned in this paper:
Zhang, Longkai and Li, Li and He, Zhengyan and Wang, Houfeng and Sun, Ni
Experiment
To keep the experiment tractable, we first randomly choose 50,000 of all the texts as unlabeled data , which contain 2,420,037 characters.
Experiment
We also experimented on different size of unlabeled data to evaluate the performance when adding unlabeled target domain data.
Experiment
TABLE 5 shows different f-scores and OOV—Recalls on different unlabeled data set.
unlabeled data is mentioned in 5 sentences in this paper.
Topics mentioned in this paper:
Lan, Man and Xu, Yu and Niu, Zhengyu
Implementation Details of Multitask Learning Method
BLLIP North American News Text (Complete) is used as unlabeled data source to generate synthetic labeled data.
Related Work
Due to the lack of benchmark data for implicit discourse relation analysis, earlier work used unlabeled data to generate synthetic implicit discourse data.
Related Work
Research work in this category exploited both labeled and unlabeled data for discourse relation prediction.
Related Work
(Hernault et al., 2010) presented a semi-supervised method based on the analysis of co-occurring features in labeled and unlabeled data .
unlabeled data is mentioned in 4 sentences in this paper.
Topics mentioned in this paper:
Volkova, Svitlana and Wilson, Theresa and Yarowsky, David
Abstract
Starting with a domain-independent, high-precision sentiment lexicon and a large pool of unlabeled data , we bootstrap Twitter-specific sentiment lexicons, using a small amount of labeled data to guide the process.
Lexicon Bootstrapping
To create a Twitter-specific sentiment lexicon for a given language, we start with a general-purpose, high-precision sentiment lexicon2 and bootstrap from the unlabeled data (BOOT) using the labeled development data (DEV) to guide the process.
Lexicon Bootstrapping
On each iteration i 2 1, tweets in the unlabeled data are labeled using the lexicon
Related Work
Corpus-based methods extract subjectivity and sentiment lexicons from large amounts of unlabeled data using different similarity metrics to measure the relatedness between words.
unlabeled data is mentioned in 4 sentences in this paper.
Topics mentioned in this paper:
Labutov, Igor and Lipson, Hod
Abstract
Recently, with an increase in computing resources, it became possible to learn rich word embeddings from massive amounts of unlabeled data .
Introduction
Traditionally, we would learn the embeddings for the target task jointly with whatever unlabeled data we may have, in an instance of semi-supervised learning, and/or we may leverage labels from multiple other related tasks in a multitask approach.
Related Work
Our method is different in that the (potentially) massive amount of unlabeled data is not required a-priori, but only the resultant embedding.
unlabeled data is mentioned in 3 sentences in this paper.
Topics mentioned in this paper: