Index of papers in Proc. ACL 2013 that mention

unlabeled data

Seen in text as:

unlabeled data (116)

Seen in 111 sentences in 9 papers.

1. Scaling Semi-supervised Naive Bayes with Feature Marginals

Lucas, Michael and Downey, Doug

In Proc. ACL 2013, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Abstract	Semi-supervised learning (SSL) methods augment standard machine learning (ML) techniques to leverage unlabeled data .
Abstract	However, existing SSL techniques typically require multiple passes over the entirety of the unlabeled data , meaning the techniques are not applicable to large corpora being produced today.
Abstract	In this paper, we show that improving marginal word frequency estimates using unlabeled data can enable semi-supervised text classification that scales to massive unlabeled data sets.
Introduction	Semi-supervised Learning (SSL) is a Machine Learning (ML) approach that utilizes large amounts of unlabeled data , combined with a smaller amount of labeled data, to learn a target function (Zhu, 2006; Chapelle et al., 2006).
Introduction	Typically, for each target concept to be learned, a semi-supervised classifier is trained using iterative techniques that execute multiple passes over the unlabeled data (e. g., Expectation-Maximization (Nigam et al., 2000) or Label Propagation (Zhu and Ghahramani, 2002)).
Introduction	Instead of utilizing unlabeled examples directly for each given target concept, our approach is to pre-compute a small set of statistics over the unlabeled data in advance.
Problem Definition	MNB-FM attempts to improve MNB’s estimates of 6:; and 6,; , using statistics computed over the unlabeled data .
Problem Definition	Equation 2 can be estimated in advance, Without knowledge of the target class, simply by counting the number of tokens of each word in the unlabeled data .
Problem Definition	The above example illustrates how MNB-FM can leverage frequency marginal statistics computed over unlabeled data to improve MNB’s conditional probability estimates.

unlabeled data is mentioned in 25 sentences in this paper.

Topics mentioned in this paper:

2. Graph-based Semi-Supervised Model for Joint Chinese Word Segmentation and Part-of-Speech Tagging

Zeng, Xiaodong and Wong, Derek F. and Chao, Lidia S. and Trancoso, Isabel

In Proc. ACL 2013, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Abstract	One constructs a nearest-neighbor similarity graph over all trigrams of labeled and unlabeled data for propagating syntactic information, i.e., label distributions.
Abstract	The derived label distributions are regarded as virtual evidences to regularize the learning of linear conditional random fields (CRFs) on unlabeled data .
Introduction	Therefore, semi-supervised joint S&T appears to be a natural solution for easily incorporating accessible unlabeled data to improve the joint S&T model.
Introduction	Motivated by the works in (Subramanya et al., 2010; Das and Smith, 2011), for structured problems, graph-based label propagation can be employed to infer valuable syntactic information (n-gram-level label distributions) from labeled data to unlabeled data .
Introduction	labeled and unlabeled data to achieve the semi-supervised learning.
Method	The emphasis of this work is on building a joint S&T model based on two different kinds of data sources, labeled and unlabeled data .
Method	In essence, this learning problem can be treated as incorporating certain gainful information, e.g., prior knowledge or label constraints, of unlabeled data into the supervised model.
Related Work	Sun and Xu (2011) enhanced a CWS model by interpolating statistical features of unlabeled data into the CRFs model.
Related Work	(2011) proposed a semi-supervised pipeline S&T model by incorporating n-gram and lexicon features derived from unlabeled data .
Related Work	Different from their concern, our emphasis is to learn the semi-supervised model by injecting the label information from a similarity graph constructed from labeled and unlabeled data .

unlabeled data is mentioned in 33 sentences in this paper.

Topics mentioned in this paper:

3. Co-regularizing character-based and word-based models for semi-supervised Chinese word segmentation

Zeng, Xiaodong and Wong, Derek F. and Chao, Lidia S. and Trancoso, Isabel

In Proc. ACL 2013, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Abstract	Similarly to multi-view learning, the “segmentation agreements” between the two different types of view are used to overcome the scarcity of the label information on unlabeled data .
Abstract	The agreements are regarded as a set of valuable constraints for regularizing the learning of both models on unlabeled data .
Introduction	Sun and Xu (2011) enhanced the segmentation results by interpolating the statistics-based features derived from unlabeled data to a CRFs model.
Introduction	The crux of solving semi-supervised learning problem is the learning on unlabeled data .
Introduction	Inspired by multi-view learning that exploits redundant views of the same input data (Ganchev et al., 2008), this paper proposes a semi-supervised CWS model of co-regularizing from two different views (intrinsically two different models), character-based and word-based, on unlabeled data .
Semi-supervised Learning via Co-regularizing Both Models	As mentioned earlier, the primary challenge of semi-supervised CWS concentrates on the unlabeled data .
Semi-supervised Learning via Co-regularizing Both Models	Obviously, the learning on unlabeled data does not come for “free”.
Semi-supervised Learning via Co-regularizing Both Models	Very often, it is necessary to discover certain gainful information, e.g., label constraints of unlabeled data , that is incorporated to guide the learner toward a desired solution.

unlabeled data is mentioned in 17 sentences in this paper.

Topics mentioned in this paper:

4. Semi-Supervised Semantic Tagging of Conversational Understanding using Markov Topic Regression

Celikyilmaz, Asli and Hakkani-Tur, Dilek and Tur, Gokhan and Sarikaya, Ruhi

In Proc. ACL 2013, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Experiments	As for unlabeled data we crawled the web and collected around 100,000 questions that are similar in style and length to the ones in QuestionBank, e.g.
Experiments	A CRF model is used to decode the unlabeled data to generate more labeled examples for retraining.
Experiments	smooth the semantic tag posteriors of a unlabeled data decoded by the CRF model using Eq.
Introduction	To the best of our knowledge, our work is the first to explore the unlabeled data to iteratively adapt the semantic tagging models for target domains, preserving information from the previous iterations.
Related Work and Motivation	Adapting the source domain using unlabeled data is the key to achieving good performance across domains.
Semi-Supervised Semantic Labeling	The last term is the loss on unlabeled data from target domain with a hyper-parameter 7'.
Semi-Supervised Semantic Labeling	After we decode the unlabeled data , we retrain a new CRF model at each iteration.
Semi-Supervised Semantic Labeling	Each iteration makes predictions on the semantic tags of unlabeled data with varying posterior probabilities.

unlabeled data is mentioned in 10 sentences in this paper.

Topics mentioned in this paper:

5. Text Classification from Positive and Unlabeled Data using Misclassified Data Correction

Fukumoto, Fumiyo and Suzuki, Yoshimi and Matsuyoshi, Suguru

In Proc. ACL 2013, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Abstract	This paper addresses the problem of dealing with a collection of labeled training documents, especially annotating negative training documents and presents a method of text classification from positive and unlabeled data .
Conclusion	The research described in this paper involved text classification using positive and unlabeled data .
Experiments	The rest of the positive and negative documents are used as unlabeled data .
Experiments	The number of positive training data in other three methods depends on the value of 6, and the rest of the positive and negative documents were used as unlabeled data .
Experiments	Our goal is to achieve classification accuracy from only positive documents and unlabeled data as high as that from labeled positive and negative data.
Framework of the System	First, we randomly select documents from unlabeled data (U) where the number of documents is equal to that of the initial positive training documents (Pl).
Framework of the System	For the result of correction (R001, we train SVM classifiers, and classify the remaining unlabeled data (U \ N1).
Introduction	Several authors have attempted to improve classification accuracy using only positive and unlabeled data (Yu et al., 2002; Ho et al., 2011).
Introduction	Our goal is to eliminate the need for manually collecting training documents, and hopefully achieve classification accuracy from positive and unlabeled data as high as that from labeled positive and labeled negative data.
Introduction	Like much previous work on semi-supervised ML, we apply SVM to the positive and unlabeled data , and add the classification results to the training data.

unlabeled data is mentioned in 10 sentences in this paper.

Topics mentioned in this paper:

SVM (23)
unlabeled data (10)
F-score (6)

6. Improving Chinese Word Segmentation on Micro-blog Using Rich Punctuations

Zhang, Longkai and Li, Li and He, Zhengyan and Wang, Houfeng and Sun, Ni

In Proc. ACL 2013, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Experiment	To keep the experiment tractable, we first randomly choose 50,000 of all the texts as unlabeled data , which contain 2,420,037 characters.
Experiment	We also experimented on different size of unlabeled data to evaluate the performance when adding unlabeled target domain data.
Experiment	TABLE 5 shows different f-scores and OOV—Recalls on different unlabeled data set.

unlabeled data is mentioned in 5 sentences in this paper.

Topics mentioned in this paper:

7. Leveraging Synthetic Discourse Data via Multi-task Learning for Implicit Discourse Relation Recognition

Lan, Man and Xu, Yu and Niu, Zhengyu

In Proc. ACL 2013, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Implementation Details of Multitask Learning Method	BLLIP North American News Text (Complete) is used as unlabeled data source to generate synthetic labeled data.
Related Work	Due to the lack of benchmark data for implicit discourse relation analysis, earlier work used unlabeled data to generate synthetic implicit discourse data.
Related Work	Research work in this category exploited both labeled and unlabeled data for discourse relation prediction.
Related Work	(Hernault et al., 2010) presented a semi-supervised method based on the analysis of co-occurring features in labeled and unlabeled data .

unlabeled data is mentioned in 4 sentences in this paper.

Topics mentioned in this paper:

8. Exploring Sentiment in Social Media: Bootstrapping Subjectivity Clues from Multilingual Twitter Streams

Volkova, Svitlana and Wilson, Theresa and Yarowsky, David

In Proc. ACL 2013, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Abstract	Starting with a domain-independent, high-precision sentiment lexicon and a large pool of unlabeled data , we bootstrap Twitter-specific sentiment lexicons, using a small amount of labeled data to guide the process.
Lexicon Bootstrapping	To create a Twitter-specific sentiment lexicon for a given language, we start with a general-purpose, high-precision sentiment lexicon2 and bootstrap from the unlabeled data (BOOT) using the labeled development data (DEV) to guide the process.
Lexicon Bootstrapping	On each iteration i 2 1, tweets in the unlabeled data are labeled using the lexicon
Related Work	Corpus-based methods extract subjectivity and sentiment lexicons from large amounts of unlabeled data using different similarity metrics to measure the relatedness between words.

unlabeled data is mentioned in 4 sentences in this paper.

Topics mentioned in this paper:

9. Re-embedding words

Labutov, Igor and Lipson, Hod

In Proc. ACL 2013, part of Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Abstract	Recently, with an increase in computing resources, it became possible to learn rich word embeddings from massive amounts of unlabeled data .
Introduction	Traditionally, we would learn the embeddings for the target task jointly with whatever unlabeled data we may have, in an instance of semi-supervised learning, and/or we may leverage labels from multiple other related tasks in a multitask approach.
Related Work	Our method is different in that the (potentially) massive amount of unlabeled data is not required a-priori, but only the resultant embedding.

unlabeled data is mentioned in 3 sentences in this paper.

Topics mentioned in this paper: