A grammar for semantic tagging | For these reasons, we evaluate our grammar model on the task of automatic tagging of queries for which we have labeled data available. |
A grammar for semantic tagging | The model, however, extends the lexicon by including words discovered from labeled data (if available). |
A grammar for semantic tagging | It is true that the word “vs” plays a critical role in this query, representing that the user’s intention is to compare the two brands; but as mentioned above in our labeled data such words has left unlabeled. |
Discriminative re-ranking | In particular, when there is no or a very small amount of labeled data , a parser could still work by using unsupervised learning approaches to learn the rules, or by simply using a set of hand-built rules (as we did above for the task of semantic tagging). |
Discriminative re-ranking | When there is enough labeled data, then a discriminative model can be trained on the labeled data to learn contextual information and to further enhance the tagging performance. |
Introduction | Preparing labeled data , however, is very expensive. |
Introduction | Therefore in cases where there is no or a small amount of labeled data available, these models do a poor job. |
Introduction | As seen later, in the case where there is not a large amount of labeled data available, the parser part is the dominant part of the module and performs reasonably well. |
Summary | This is a big advantage of the parser model, because in practice providing labeled data is very expensive but very often the lexicons can be easily extracted from the structured data on the web (for example extracting movie titles from imdb or book titles from Amazon). |
Abstract | Chinese) using labeled data in the source language (e.g. |
Abstract | Most existing work relies on machine translation engines to directly adapt labeled data from the source language to the target language. |
Abstract | Experiments on multiple data sets show that CLMM is consistently effective in two settings: (1) labeled data in the target language are unavailable; and (2) labeled data in the target language are also available. |
Introduction | Therefore, it is desirable to use the English labeled data to improve sentiment classification of documents in other languages. |
Introduction | One direct approach to leveraging the labeled data in English is to use machine translation engines as a black box to translate the labeled data from English to the target language (e.g. |
Introduction | First, the vocabulary covered by the translated labeled data is limited, hence many sentiment indicative words can not be learned from the translated labeled data . |
Abstract | In this paper, we propose a domain adaptation framework for sentiment- and topic- lexicon co-extraction in a domain of interest where we do not require any labeled data, but have lots of labeled data in another related domain. |
Introduction | In this paper, we focus on the co-extraction task of sentiment and topic lexicons in a target domain where we do not have any labeled data, but have plenty of labeled data in a source domain. |
Introduction | lize useful labeled data from the source domain as well as exploit the relationships between the topic and sentiment words to propagate information for lexicon construction in the target domain. |
Introduction | labeled data be available in the target domain. |
Abstract | We describe a sentiment classification method that is applicable when we do not have any labeled data for a target domain but have some labeled data for multiple other domains, designated as the source domains. |
Experiments | Figure 3: Effect of source domain labeled data . |
Experiments | To investigate the impact of the quantity of source domain labeled data on our method, we vary the amount of data from zero to 800 reviews, with equal amounts of positive and negative labeled data . |
Experiments | Note that source domain labeled data is used both to create the sentiment sensitive thesaurus as well as to train the sentiment classifier. |
Introduction | Supervised learning algorithms that require labeled data have been successfully used to build sentiment classifiers for a specific domain (Pang et al., 2002). |
Introduction | positive or negative sentiment) given a small set of labeled data for the source domain, and unlabeled data for both source and target domains. |
Introduction | In particular, no labeled data is provided for the target domain. |
Introduction | It is well-known that sentiment classification is very domain-specific (Blitzer et al., 2007), so it is critical to eliminate its dependence on a large-scale labeled data for its wide applications. |
Related Work | Supervised methods consider sentiment classification as a standard classification problem in which labeled data in a domain are used to train a domain-specific classifier. |
Unsupervised Mining of Personal and Impersonal Views | The co-training algorithm is a specific semi-supervised learning approach which starts with a set of labeled data and increases the amount of labeled data using the unlabeled data by bootstrapping (Blum and Mitchell, 1998). |
Unsupervised Mining of Personal and Impersonal Views | Input: The labeled data L containing personal sentence set Sbmmwl and impersonal sentence set |
Unsupervised Mining of Personal and Impersonal Views | U — persona SU —impersonal Output: New labeled data L Procedure: |
Abstract | To improve the performance of a cause identification system for the minority classes, we present a bootstrapping algorithm that automatically augments a training set by learning from a small amount of labeled data and a large amount of unlabeled data. |
Abstract | Experimental results show that our algorithm yields a relative error reduction of 6.3% in F-measure for the minority classes in comparison to a baseline that learns solely from the labeled data . |
Baseline Approaches | mate goal is to evaluate the effectiveness of our bootstrapping algorithm, the baseline approaches only make use of small amounts of labeled data for acquiring classifiers. |
Baseline Approaches | To ensure a fair comparison with the first baseline, we do not employ additional labeled data for parameter tuning; rather, we reserve 25% of the available training data for tuning, and use the remaining 75% for classifier |
Introduction | The difficulty of a text classification task depends on various factors, but typically, the task can be difficult if (1) the amount of labeled data available for learning the task is small; (2) it involves multiple classes; (3) it involves multi-label categorization, where more than one label can be assigned to each document; (4) the class distributions are skewed, with some categories significantly outnumbering the others; and (5) the documents belong to the same domain (e. g., movie review classification). |
Introduction | Such methods, however, are unlikely to perform equally well for our cause identification task given our small labeled set, as the minority class prediction problem is complicated by the scarcity of labeled data . |
Introduction | More specifically, given the scarcity of labeled data , many words that are potentially correlated with a shaper (especially a minority shaper) may not appear in the training set, and the lack of such useful indicators could hamper the acquisition of an accurate classifier via supervised learning techniques. |
Our Bootstrapping Algorithm | One of the potential weaknesses of the two baselines described in the previous section is that the classifiers are trained on only a small amount of labeled data . |
Our Bootstrapping Algorithm | The situation is somewhat aggravated by the fact that we are adopting a one-versus-all scheme for generating training instances for a particular shaper, which, together with the small amount of labeled data , implies that only a couple of positive instances may be available for training the classifier for a minority class. |
Our Bootstrapping Algorithm | The reason we impose the “at least three” requirement is precision: we want to ensure, with a reasonable level of confidence, that the unlabeled documents chosen to augment P should indeed be labeled with the shaper under consideration, as incorrectly labeled documents would contaminate the labeled data, thus accelerating the deterioration of the quality of the automatically labeled data in subsequent bootstrapping iterations and adversely affecting the accuracy of the classifier trained on it (Pierce and Cardie, 2001). |
A Joint Model with Unlabeled Parallel Text | where v E {1,2} denotes L1 or L2; the first term on the right-hand side is the likelihood of labeled data for both D1 and D2; and the second term is the likelihood of the unlabeled parallel data U. |
A Joint Model with Unlabeled Parallel Text | By further considering the weight to ascribe to the unlabeled data vs. the labeled data (and the weight for the L2-norm regularization), we get the following regularized joint log likelihood to be maximized: |
A Joint Model with Unlabeled Parallel Text | where the first term on the right-hand side is the log likelihood of the labeled data from both D1 and D2; the second is the log likelihood of the unlabeled parallel data U, multiplied by Al 2 O, a constant that controls the contribution of the unlabeled data; and x12 2 0 is a regularization constant that penalizes model complexity or large feature weights. |
Abstract | We present a novel approach for joint bilingual sentiment classification at the sentence level that augments available labeled data in each language with unlabeled parallel data. |
Introduction | Given the labeled data in each language, we propose an approach that exploits an unlabeled parallel corpus with the following |
Introduction | The proposed maximum entropy-based EM approach jointly learns two monolingual sentiment classifiers by treating the sentiment labels in the unlabeled parallel text as unobserved latent variables, and maximizes the regularized joint likelihood of the language-specific labeled data together with the inferred sentiment labels of the parallel text. |
Abstract | In addition, it is challenging to generate sufficient high quality labeled data for supervised models with low cost. |
Abstract | Compared to the state-of-the-art supervised model trained from 100% labeled data, our proposed approach achieves comparable performance with 31% labeled data and obtains 5% absolute Fl gain with 50% labeled data . |
Conclusions | By studying three novel fine-grained relations, detecting semantically-related information with semantic meta paths, and exploiting the data manifolds in both unlabeled and labeled data for collective inference, our work can dramatically save annotation cost and achieve better performance, thus shed light on the challenging wikification task for tweets. |
Experiments | In comparision with the supervised baseline proposed by (Meij et al., 2012), our model SSRega1 relying on local compatibility already achieves comparable performance with 50% of labeled data . |
Experiments | We can easily see that our proposed approach using 50% labeled data achieves similar performance with the state-of-the-art supervised model with 100% labeled data . |
Experiments | 6.4 Effect of Labeled Data Size |
Introduction | Sufficient labeled data is crucial for supervised models. |
Introduction | In order to address these unique challenges for wikification for the short tweets, we employ graph-based semi-supervised learning algorithms (Zhu et al., 2003; Smola and Kondor, 2003; Blum et al., 2004; Zhou et al., 2004; Talukdar and Crammer, 2009) for collective inference by exploiting the manifold (cluster) structure in both unlabeled and labeled data . |
Abstract | We address two challenges: negative transfer when knowledge in source domains is used without considering the differences in relation distributions; and lack of adequate labeled samples for rarer relations in the new domain, due to a small labeled data set and imbalance relation distributions. |
Introduction | However, most supervised learning algorithms require adequate labeled data for every relation type to be extracted. |
Introduction | Instead, it can be more cost-effective to adapt an existing relation extraction system to the new domain using a small set of labeled data . |
Introduction | Together with imbalanced relation distributions inherent in the domain, this can cause some rarer relations to constitute only a very small proportion of the labeled data set. |
Problem Statement | The target domain has a few labeled data D; 2 {(xi, yi)}:. |
Problem Statement | For the sth source domain, we have an adequate labeled data set DS. |
Related Work | However, purely supervised relation extraction methods assume the availability of sufficient labeled data , which may be costly to obtain for new domains. |
Related Work | We address this by augmenting a small labeled data set with other information in the domain adaptation setting. |
Related Work | To create labeled data , the texts are dependency-parsed, and the domain-independent patterns on the parses form the basis for extractions. |
Robust Domain Adaptation | By augmenting with unlabeled data D;,, we aim to alleviate the effect of imbalanced relation distribution, which causes a lack of labeled samples for rarer classes in a small set of labeled data . |
Evaluation | Owing to the randomness involved in the choice of labeled data , all baseline results are averaged over ten independent runs for each fold. |
Evaluation | We implemented Kamvar et al.’s (2003) semi-supervised spectral clustering algorithm, which incorporates labeled data into the clustering framework in the form of must-link and cannot-link constraints. |
Evaluation | We employ as our second baseline a transductive SVM5 trained using 100 points randomly sampled from the training folds as labeled data and the remaining 1900 points as unlabeled data. |
Introduction | perimental results on five sentiment classification datasets demonstrate that our system can generate high-quality labeled data from unambiguous reviews, which, together with a small number of manually labeled reviews selected by the active learner, can be used to effectively classify ambiguous reviews in a discriminative fashion. |
Our Approach | However, in the absence of labeled data , it is not easy to assess feature relevance. |
Our Approach | Even if labeled data were present, the ambiguous points might be better handled by a discriminative leam-ing system than a clustering algorithm, as discriminative learners are more sophisticated, and can handle ambiguous feature space more effectively. |
Our Approach | In self-training, we iteratively train a classifier on the data labeled so far, use it to classify the unlabeled instances, and augment the labeled data with the most confidently labeled instances. |
Abstract | With a new representation of graph-based SSL on QA datasets using only a handful of features, and under limited amounts of labeled data , we show improvement in generalization performance over state-of-the-art QA models. |
Experiments | In the first part, we randomly selected subsets of labeled training dataset X2 C X L with different sample sizes, ={l% >|< 711;, 5% >|< 71L, 10% * 71L, 25% * 71L, 50% * 71L, 100% * 711;}, where 71,; represents the sample size of X L. At each random selection, the rest of the labeled dataset is hypothetically used as unlabeled data to verify the performance of our SSL using different sizes of labeled data . |
Experiments | Note from Table 2 that, when the number of labeled data is small (7133 < 10% * 7113), graph based SSL, gSum SSL, has a better performance compared to SVM. |
Experiments | Especially in Hybrid graph-Summary SSL, Hybrid gSum SSL, when the number of labeled data is small (7133 < 25% >x< 7113) performance improvement is better than rest |
Graph Summarization | The labeled data points, i.e., X L, are appended to each of these selected X 5 datasets, X5 = {mi,...mfn_l} U XL. |
Graph Summarization | The local density constraints become crucial for inference where summarized labeled data are used instead of overall dataset. |
Graph Summarization | As a result q number of summary datasets XS each of which with nb labeled data points are combined to form a representative sample of X, X = {X8 }:=1 reducing the number of data from n to a much smaller number of data, p = q * nb << n. So the new summary of the X can be represented with X = {Xi}§=1. |
Introduction | One of the challenges we face with is that we have very limited amount of labeled data , i.e., correctly labeled (true/false entailment) sentences. |
Introduction | We consider situations where there are much more unlabeled data, X U, than labeled data , X L, i.e., 711; << 711]. |
Introduction | — application of a graph-summarization method to enable learning from a very large unlabeled and rather small labeled data , which would not have been feasible for most sophisticated learning tools in section 4. |
Datasets | (2006), we use sections 2 — 21 from Wall Street Journal (WSJ) as the source domain labeled data . |
Distribution Prediction | Our distribution prediction learning method is unsupervised in the sense that it does not require manually labeled data for a particular task from any of the domains. |
Domain Adaptation | The main reason that a model trained only on the source domain labeled data performs poorly in the target domain is the feature mismatch — few features in target domain test instances appear in source domain training instances. |
Experiments and Results | For each domain, the accuracy obtained y a classifier trained using labeled data from that |
Experiments and Results | This upper baseline represents the classification accuracy we could hope to obtain if we were to have labeled data for the target domain. |
Introduction | Our proposed cross-domain word distribution prediction method is unsupervised in the sense that it does not require any labeled data in either of the two steps. |
O \ | Unlike our distribution prediction method, which is unsupervised, SST requires labeled data for the source domain to learn a feature mapping between a source and a target domain in the form of a thesaurus. |
Related Work | (2006) append the source domain labeled data with predicted pivots (i.e. |
Related Work | The unsupervised DA setting that we consider does not assume the availability of labeled data for the target domain. |
Related Work | However, if a small amount of labeled data is available for the target domain, it can be used to further improve the performance of DA tasks (Xiao et al., 2013; Daume III, 2007). |
Abstract | SSL techniques are often effective in text classification, where labeled data is scarce but large unlabeled corpora are readily available. |
Introduction | Semi-supervised Learning (SSL) is a Machine Learning (ML) approach that utilizes large amounts of unlabeled data, combined with a smaller amount of labeled data , to learn a target function (Zhu, 2006; Chapelle et al., 2006). |
Introduction | Then, for a given target class and labeled data set, we utilize the statistics to improve a classifier. |
Introduction | The marginal statistics are used as a constraint to improve the class-conditional probability estimates P (w | +) and P (w | —) for the positive and negative classes, which are often noisy when estimated over sparse labeled data sets. |
Problem Definition | In particular, SFE uses the equality P(+|w) = P(+, and estimates the rhs using P computed over all the unlabeled data, rather than using only labeled data as in standard MNB. |
Problem Definition | Further, it can be shown that as P(w) of a word 21) in the unlabeled data becomes larger than that in the labeled data , SFE’s estimate of the ratio P(w|+) /P(w|—) approaches one. |
Problem Definition | Depending on the labeled data , such an estimate can be arbitrarily inaccurate. |
Abstract | With a conditional random field based probabilistic dependency parser, our training objective is to maximize mixed likelihood of labeled data and auto-parsed unlabeled data with ambiguous labelings. |
Ambiguity-aware Ensemble Training | not sufficiently covered in manually labeled data . |
Ambiguity-aware Ensemble Training | Since 13’ contains much more instances than D (1.7M vs. 40K for English, and 4M vs. 16K for Chinese), it is likely that the unlabeled data may overwhelm the labeled data during SGD training. |
Ambiguity-aware Ensemble Training | 1: Input: Labeled data D = {(Xi,di)}£\;1, and unlabeled data 13’ = {(u,,v,)}§‘:1;Parameters: I, N1, M1, b : Output: w : Initialization: Wm) = O, k: = 0; : forz' = 1 to I do {iterations} Randomly select N1 instances from D and M1 instances from D’ to compose a new dataset Di, and shuffle it. |
Conclusions | The training objective is to maximize the mixed likelihood of both the labeled data and the auto-parsed unlabeled data with ambiguous labelings. |
Introduction | Such sentences can provide more discriminative instances for training which may be unavailable in labeled data . |
Introduction | Evaluation on labeled data shows the oracle accuracy of parse forest is much higher than that of l-best outputs of single parsers (see Table 3). |
Introduction | Finally, using a conditional random field (CRF) based probabilistic parser, we train a better model by maximizing mixed likelihood of labeled data and auto-parsed unlabeled data with ambiguous labelings. |
AL-SMT: Multilingual Setting | When (re-)training the models, two phrase tables are learned for each SMT model: one from the labeled data 11.. and the other one from pseudo-labeled data lU+ (which we call the main and auxiliary phrase tables respectively). |
Experiments | We subsampled 5,000 sentences as the labeled data 11.. and 20,000 sentences as [U for the pool of untranslated sentences (while hiding the English part). |
Introduction | However, if we start with only a small amount of initial parallel data for the new target language, then translation quality is very poor and requires a very large injection of human labeled data to be effective. |
Introduction | In self-training each MT system is retrained using human labeled data plus its own noisy translation output on the unlabeled data. |
Introduction | In co-training each MT system is retrained using human labeled data plus noisy translation output from the other MT systems in the ensemble. |
Sentence Selection: Single Language Pair | The more frequent a phrase is in the labeled data , the more unimportant it is; since probably we have observed most of its translations. |
Sentence Selection: Single Language Pair | In the labeled data L, phrases are the ones which are extracted by the SMT models; but what are the candidate phrases in the unlabeled data U? |
Sentence Selection: Single Language Pair | two multinomials, one for labeled data and the other one for unlabeled data. |
Abstract | However, in some cases a small amount of human labeled data is available. |
Available at http://nlp. stanford.edu/software/mimlre. shtml. | Thus, our approach outperforms state-of-the-art model for relation extraction using much less labeled data that was used by Zhang et al., (2012) to outper- |
Introduction | In this paper, we present the first effective approach, Guided DS (distant supervision), to incorporate labeled data into distant supervision for extracting relations from sentences. |
Introduction | (2012), we generalize the labeled data through feature selection and model this additional information directly in the latent variable approaches. |
Introduction | While prior work employed tens of thousands of human labeled examples (Zhang et al., 2012) and only got a 6.5% increase in F-score over a logistic regression baseline, our approach uses much less labeled data (about 1/8) but achieves much higher improvement on performance over stronger baselines. |
The Challenge | Simply taking the union of the hand-labeled data and the corpus labeled by distant supervision is not effective since hand-labeled data will be swamped by a larger amount of distantly labeled data . |
The Challenge | An effective approach must recognize that the hand-labeled data is more reliable than the automatically labeled data and so must take precedence in cases of conflict. |
The Challenge | Instead we propose to perform feature selection to generalize human labeled data into training guidelines, and integrate them into latent variable model. |
Training | Upsam—pling the labeled data did not improve the performance either. |
Abstract | This noisy labeled data causes poor extraction performance. |
Experiments | We compared the following methods: logistic regression with the labeled data cleaned by the proposed method (PROP), logistic regression with the standard DS labeled data (LR), and MultiR proposed in (Hoffmann et al., 2011) as a state-of-the-art multi-instance learning system.7 For logistic regression, when more than one relation is assigned to a sentence, we simply copied the feature vector and created a training example for each relation. |
Introduction | Supervised approaches are limited in scalability because labeled data is expensive to produce. |
Introduction | A particularly attractive approach, called distant supervision (DS), creates labeled data by heuristically aligning entities in text with those in a knowledge base, such as Freebase (Mintz et al., 2009). |
Introduction | However, the DS assumption can fail, which results in noisy labeled data and this causes poor extraction performance. |
Knowledge-based Distant Supervision | DS uses a knowledge base to create labeled data for relation extraction by heuristically matching entity pairs. |
Related Work | Previous works (Hoffmann et al., 2010; Yao et al., 2010) have pointed out that the DS assumption generates noisy labeled data , but did not directly address the problem. |
Wrong Label Reduction | labeled data generated by DS: LD |
Wrong Label Reduction | For relation extraction, we train a classifier for entity pairs using the resultant labeled data . |
Empirical Evaluation | For every pair, the semi-supervised methods use labeled data from the source domain and unlabeled data from both domains. |
Empirical Evaluation | We compare them with two supervised methods: a supervised model (Base) which is trained on the source domain data only, and another supervised model (In-domain) which is learned on the labeled data from the target domain. |
Introduction | In addition to the labeled data from the source domain, they also exploit small amounts of labeled data and/or unlabeled data from the target domain to estimate a more predictive model for the target domain. |
Introduction | We use generative latent variable models (LVMs) learned on all the available data: unlabeled data for both domains and on the labeled data for the source domain. |
Related Work | Second, their expectation constraints are estimated from labeled data , whereas we are trying to match expectations computed on unlabeled data for two domains. |
Related Work | This approach bears some similarity to the adaptation methods standard for the setting where labelled data is available for both domains (Chelba and Acero, 2004; Daume and Marcu, 2006). |
The Latent Variable Model | 1Among the versions which do not exploit labeled data from the target domain. |
The Latent Variable Model | The parameters of this model 6 = (12,10) can be estimated by maximizing joint likelihood L(6) of labeled data for the source domain {330), y(l)}l€3L |
The Latent Variable Model | However, given that, first, amount of unlabeled data |SU U TU| normally vastly exceeds the amount of labeled data |SL| and, second, the number of features for each example |a3(l)| is usually large, the label y will have only a minor effect on the mapping from the initial features a: to the latent representation z (i.e. |
Abstract | Using utterances from five domains, our approach shows up to 4.5% improvement on domain and dialog act performance over cascaded approach in which each semantic component is learned sequentially and a supervised joint learning model (which requires fully labeled data ). |
Data and Approach Overview | Our algorithm assigns domairfldialog-act/slot labels to each topic at each layer in the hierarchy using labeled data (explained in §4.) |
Experiments | Here, we not only want to demonstrate the performance of each component of MCM but also their performance under limited amount of labeled data . |
Experiments | When the number of labeled data is small (niL £25%*nL), our WebPrior—MCM has a better performance on domain and act predictions compared to the two baselines. |
Experiments | Adding labeled data improves the performance of all models however supervised models benefit more compared to MCM models. |
Introduction | The contributions of this paper are as follows: (i) construction of a novel Bayesian framework for semantic parsing of natural language (NL) utterances in a unifying framework in §4, (ii) representation of seed labeled data and information from web queries as informative prior to design a novel utterance understanding model in §3 & §4, (iii) comparison of our results to supervised sequential and joint learning methods on NL utterances in §5. |
Introduction | We conclude that our generative model achieves noticeable improvement compared to discriminative models when labeled data is scarce. |
Abstract | We show that this unsuperVised system has better CoRe performance than other learning approaches that do not use manually labeled data . |
Introduction | Until recently, most approaches tried to solve the problem by binary classification, where the probability of a pair of markables being coreferent is estimated from labeled data . |
Introduction | Self-training approaches usually include the use of some manually labeled data . |
Introduction | In contrast, our self-trained system is not trained on any manually labeled data and is therefore a completely unsupervised system. |
Related Work | not with approaches that make some limited use of labeled data . |
Results and Discussion | Thus, this comparison of ACE-2/Ontonotes results is evidence that in a realistic scenario using association information in an unsupervised self-trained system is almost as good as a system trained on manually labeled data . |
System Architecture | Automatically Labeled Data |
Abstract | Finding concepts in natural language utterances is a challenging task, especially given the scarcity of labeled data for learning semantic ambiguity. |
Experiments | First a supervised learning algorithm is used to build a CRF model based on the labeled data . |
Introduction | Thus, each latent semantic class corresponds to one of the semantic tags found in labeled data . |
Markov Topic Regression - MTR | We assume a fixed K topics corresponding to semantic tags of labeled data . |
Markov Topic Regression - MTR | K latent topics to the K semantic tags of our labeled data . |
Markov Topic Regression - MTR | labeled data , 712?, based on the log-linear model in Eq. |
Semi-Supervised Semantic Labeling | (5) is the loss on the labeled data and £2 regularization on parameters, Ag), from nth iteration, same as standard CRF. |
Semi-Supervised Semantic Labeling | The labeled rows ml of the vocabulary matrix, m={wl,m“}, contain only {0,1} values, indicating the word’s observed semantic tags in the labeled data . |
Abstract | To overcome the shortage of labeled data for implicit discourse relation recognition, previous works attempted to automatically generate training data by removing explicit discourse connectives from sentences and then built models on these synthetic implicit examples. |
Implementation Details of Multitask Learning Method | nectives and relations in PDTB and generate synthetic labeled data by removing the connectives. |
Implementation Details of Multitask Learning Method | BLLIP North American News Text (Complete) is used as unlabeled data source to generate synthetic labeled data . |
Implementation Details of Multitask Learning Method | In comparison with the synthetic labeled data generated from the explicit relations in PDTB, the synthetic labeled data from BLLIP contains more noise. |
Multitask Learning for Discourse Relation Prediction | Following these two principles, we create the auxiliary tasks by generating automatically labeled data as follows. |
Multitask Learning for Discourse Relation Prediction | Previous work (Marcu and Echihabi, 2002) and (Sporleder and Lascarides, 2008) adopted predefined pattern-based approach to generate synthetic labeled data , where each predefined pattern has one discourse relation label. |
Multitask Learning for Discourse Relation Prediction | In contrast, we adopt an automatic approach to generate synthetic labeled data , where each discourse connective between two texts serves as their relation label. |
Background | Recently, “distant supervision” has emerged to be a popular choice for training relation extractors without using manually labeled data (Mintz et al., 2009; J iang, 2009; Chan and Roth, 2010; Wang et al., 2011; Riedel et al., 2010; Ji et al., 2011; Hoffmann et al., 2011; Sur-deanu et al., 2012; Takamatsu et al., 2012; Min et al., 2013). |
Experiments | (1) Manifold Unlabeled: We combined the labeled data and unlabeled set 1 in training. |
Experiments | (2) Manifold Predicted Labels: We combined labeled data and unlabeled set 2 in training. |
Experiments | beled data and the data from unlabeled set 2 was used as labeled data (With Weights). |
Identifying Key Medical Relations | Our current strategy is to integrate all associated types, and rely on the relation detector trained with the labeled data to decide how to weight different types based upon the context. |
Introduction | When we build a naive model to detect relations, the model tends to overfit for the labeled data . |
Relation Extraction with Manifold Models | Integration of the unlabeled data can help solve overfitting problems when the labeled data is not sufficient. |
Introduction | In the past years, several proposed supervised joint models (Ng and Low, 2004; Zhang and Clark, 2008; Jiang et al., 2009; Zhang and Clark, 2010) achieved reasonably accurate results, but the outstanding problem among these models is that they rely heavily on a large amount of labeled data , i.e., segmented texts with POS tags. |
Introduction | However, the production of such labeled data is extremely time-consuming and expensive (Jiao et al., 2006; J iang et al., 2009). |
Introduction | Motivated by the works in (Subramanya et al., 2010; Das and Smith, 2011), for structured problems, graph-based label propagation can be employed to infer valuable syntactic information (n-gram-level label distributions) from labeled data to unlabeled data. |
Method | It is especially helpful for the graph to make connections with trigrams that may not have been seen in labeled data but have similar label information. |
Method | The first term in Equation (5) is the same as Equation (2), which is the traditional CRFs leam-ing objective function on the labeled data . |
Method | To satisfy the characteristic of the semi-supervised learning problem, the train set, i.e., the labeled data , is formed by a relatively small amount of annotated texts sampled from CTB-7. |
Conclusion and Future Work | One obvious direction is to use the whole Penn Treebank as labeled data and use some other unannotated data source as unlabeled data for semi-supervised training. |
Efficient Optimization Strategy | 0 Step 2, based on the learned parameter weights from the labeled data , update 6 and Yj on each unlabeled sentence alternatively: |
Introduction | However, a key drawback of supervised training algorithms is their dependence on labeled data , which is usually very difficult to obtain. |
Introduction | This loss function has the advantage that the entire training objective on both the labeled and unlabeled data now becomes convex, since it consists of a convex structured large margin loss on labeled data and a convex least squares loss on unlabeled data. |
Introduction | In particular, we investigate a semi-supervised approach for structured large margin training, where the objective is a combination of two convex functions, the structured large margin loss on labeled data and the least squares loss on unlabeled data. |
Semi-supervised Convex Training for Structured SVM | for structured large margin training, whose objective is a combination of two convex terms: the supervised structured large margin loss on labeled data and the cheap least squares loss on unlabeled data. |
Semi-supervised Convex Training for Structured SVM | By combining the convex structured SVM loss on labeled data (shown in Equation (5)) and the convex least squares loss on unlabeled data (shown in Equation (8)), we obtain a semi-supervised structured large margin loss |
Abstract | Labeled data is not readily available for these tasks, so we focus on the unsupervised setting. |
Experiments | 7Note that we do not consider this performance to be the upper—bound of supervised approaches; clearly, supervised approaches could benefit from additional labeled data . |
Experiments | However, labeled data is relatively expensive to obtain for this task. |
Introduction | There is no existing labeled data for the tasks of interest, and we would like the methods we develop to be easily applied in multiple domains. |
Introduction | Motivated by this, we propose a generative model for solving these tasks jointly without labeled data . |
Models | To identify refinements without labeled data , we propose a generative model of reviews (or more generally documents) with latent variables. |
Abstract | The context-aware constraints provide additional power to the CRF model and can guide semi-supervised learning when labeled data is limited. |
Approach | PR makes the assumption that the labeled data we have is not enough for learning good model parameters, but we have a set of constraints on the posterior distribution of the labels. |
Experiments | For the MD dataset, we also used the dvd domain as additional labeled data for developing the constraints. |
Experiments | We found that the PR model is able to correct many CRF errors caused by the lack of labeled data . |
Experiments | However, with limited labeled data , the CRF learner can only associate very weak sentiment signals to these features. |
Related Work | Compared to the existing work on semi-supervised learning for sentence-level sentiment classification (Tackstro'm and McDonald, 2011a; Tackstrom and McDonald, 2011b; Qu et al., 2012), our work does not rely on a large amount of coarse-grained (document-level) labeled data , instead, distant supervision mainly comes from linguistically-motivated constraints. |
Experiments | In chunking, there is a clear trend toward larger increases in performance as words become rarer in the labeled data set, from a 0.02 improvement on words of frequency 2, to an improvement of 0.21 on OOV words. |
Experiments | To measure the sample complexity of the supervised CRF, we use the same experimental setup as in the chunking experiment on WSJ text, but we vary the amount of labeled data available to the CRF. |
Experiments | Thus smoothing is optimizing performance for the case where unlabeled data is plentiful and labeled data is scarce, as we would hope. |
Related Work | Several researchers have previously studied methods for using unlabeled data for tagging and chunking, either alone or as a supplement to labeled data . |
Related Work | Our technique lets the HMM find parameters that maximize cross-entropy, and then uses labeled data to learn the best mapping from the HMM categories to the POS categories. |
Related Work | Our technique uses unlabeled training data from the target domain, and is thus applicable more generally, including in web processing, where the domain and vocabulary is highly variable, and it is extremely difficult to obtain labeled data that is representative of the test distribution. |
Bilingual Bootstrapping | Algorithm 1 Bilingual Bootstrapping LD1 2: Seed Labeled Data from L1 LD2 2: Seed Labeled Data from L2 UD1 := Unlabeled Data from L1 U D2 := Unlabeled Data from L2 |
Bilingual Bootstrapping | These projected models are then applied to the untagged data of L1 and L2 and the instances which get labeled with a high confidence are added to the labeled data of the respective languages. |
Bilingual Bootstrapping | Algorithm 2 Monolingual Bootstrapping LD1 2: Seed Labeled Data from L1 LD2 2: Seed Labeled Data from L2 U D1 := Unlabeled Data from L1 U D2 := Unlabeled Data from L2 |
Experimental Setup | In each iteration only those words for which P(assigned_sense|word) > 0.6 get moved to the labeled data . |
Experimental Setup | Hence, we used a fixed threshold of 0.6 so that in each iteration only those words get moved to the labeled data for which the assigned sense is clearly a majority sense (P > 0.6). |
Capturing Paradigmatic Relations via Word Clustering | Previous research shows that character-based segmentation models trained on labeled data are reasonably accurate (Sun, 2010). |
Capturing Paradigmatic Relations via Word Clustering | Table 5 summarizes the accuracies of the systems when trained on smaller portions of the labeled data . |
Capturing Paradigmatic Relations via Word Clustering | In other words, the word cluster features can significantly reduce the amount of labeled data required by the learning algorithm. |
State-of-the-Art | In this paper, we use CTB 6.0 as the labeled data for the study. |
Conclusion | We employ stacking models to incorporate features derived from heterogeneous analysis and apply them to convert heterogeneous labeled data for retraining. |
Data-driven Annotation Conversion | It is possible to acquire high quality labeled data for a specific annotation standard by exploring existing heterogeneous corpora, since the annotations are normally highly compatible. |
Data-driven Annotation Conversion | Moreover, the exploitation of additional (pseudo) labeled data aims to reduce the estimation error and enhances a NLP system in a different way from stacking. |
Data-driven Annotation Conversion | _ CTa / _ gppd—wtb qu1re a new labeled data set D Ctb — Dppdflctb |
Introduction | This is implemented by maximizing the empirical accuracy on the prior knowledge ( labeled data ) and the entropy of hash functions (estimated over labeled and unlabeled data). |
Semi-Supervised SimHash | Let XL 2 {(X1,cl)...(xu,cu)} be the labeled data , c E {1...0}, X 6 RM, and XU = {xu+1...xN} the unlabeled data. |
Semi-Supervised SimHash | Given the labeled data XL, we construct two sets, attraction set @a and repulsion set 9?. |
Semi-Supervised SimHash | Furthermore, we also hope to maximize the empirical accuracy on the labeled data @a and @r and |
The direction is determined by concatenating w L times. | This is implemented by maximizing the empirical accuracy on labeled data together with the entropy of hash functions. |
Abstract | In our experiments on the Boston University radio news corpus, using only a small amount of the labeled data as the initial training set, our proposed labeling method combined with most confidence sample selection can effectively use unlabeled data to improve performance and finally reach performance closer to that of the supervised method using all the training data. |
Co-training strategy for prosodic event detection | Given a set L of labeled data and a set U of unlabeled data, the algorithm first creates a smaller pool U’ containing u unlabeled data. |
Conclusions | In our experiment, we used some labeled data as development set to estimate some parameters. |
Experiments and results | Among labeled data , 102 utterances of all f] a and m] 19 speakers are used for testing, 20 utterances randomly chosen from f2b, f3b, m2b, m3b, and m4b are used as development set to optimize parameters such as A and confidence level threshold, 5 utterances are used as the initial training set L, and the rest of the data is used as unlabeled set U, which has 1027 unlabeled utterances (we removed the human labels for co-training experiments). |
Experiments and results | We can see that the performance of co-training for these three tasks is slightly worse than supervised learning using all the labeled data, but is significantly better than the original performance using 3% of hand labeled data . |
Evaluation | All ranking models above were trained only on source domain training data and the labeled data of target domain was just used for testing. |
Instance Weighting Scheme Review | (J iang and Zhai, 2007) used a small number of labeled data from target domain to weight source instances. |
Introduction | To alleviate the lack of training data in the target domain, many researchers have proposed to transfer ranking knowledge from the source domain with plenty of labeled data to the target domain where only a few or no labeled data is available, which is known as ranking model adaptation (Chen et al., 2008a; Chen et al., 2010; Chen et al., 2008b; Geng et al., 2009; Gao et al., 2009). |
Related Work | In (Geng et al., 2009; Chen et al., 2008b), the parameters of ranking model trained on the source domain was adjusted with the small set of labeled data in the target domain. |
Related Work | al., 2008a) weighted source instances by using small amount of labeled data in the target domain. |
Discussion and Related Work | Ando and Zhang (2005) defined an objective function that combines the original problem on the labeled data with a set of auxiliary problems on unlabeled data. |
Discussion and Related Work | One is to leverage a large amount of unsupervised data to train an adequate classifier with a small amount of labeled data . |
Introduction | While the labeled data is generally very costly to obtain, there is a vast amount of unlabeled textual data freely available on the web. |
Introduction | Under this approach, even if a word is not found in the training data, it may still fire cluster-based features as long as it shares cluster assignments with some words in the labeled data . |
Introduction | Since the clusters are obtained without any labeled data , they may not correspond directly to concepts that are useful for decision making in the problem domain. |
Experiments | Due to these reasons, there is a lack of sufficient and high quality labeled data for emotion research. |
Experiments | Since in real world applications people are primarily concerned with how well the algorithm will work for new TV shows or movies that may not be included in the training data, we defined a test fold for each TV show or movie in our labeled data set. |
Experiments | Each test fold corresponded to a training fold containing all the labeled data from all the other TV shows and movies. |
Introduction | An active learner uses a small set of labeled data to iteratively select the most informative instances from a large pool of unlabeled data for human annotators to label (Settles, 2010). |
Related Work | In Active Learning (Settles, 2010) a small set of labeled data is used to find documents that should be annotated from a large pool of unlabeled documents. |
Experiments of Parsing | Recent studies on parsing indicate that the use of unlabeled data by self-training can help parsing on the WSJ data, even when labeled data is relatively large (McClosky et al., 2006a; Reichart and Rappoport, 2007). |
Experiments of Parsing | Table 7 shows the performance of self-trained generative parser and updated reranker on the test set, with CTB and CDTfs as labeled data . |
Experiments of Parsing | All the works in Table 8 used CTB articles 1-270 as labeled data . |
Introduction | It is important to acquire additional labeled data for the target grammar parsing through exploitation of existing source treebanks since there is often a shortage of labeled data . |
Introduction | When coupled with self-training technique, a reranking parser with CTB and converted CDT as labeled data achieves 85.2% f-score on CTB test set, an absolute 1.0% improvement (6% error reduction) over the previous best result for Chinese parsing. |
Abstract | ConceptResolver performs both word sense induction and synonym resolution on relations extracted from text using an ontology and a small amount of labeled data . |
ConceptResolver | A is re-selected on each iteration because the initial labeled data set is extremely small, so the initial validation set is not necessarily representative of the actual data. |
ConceptResolver | l. Initialize labeled data L with 10 positive and 50 negative examples (pairs of senses) |
Prior Work | These approaches use large amounts of labeled data , which can be difficult to create. |
Experiments | The data set consists of labelled data for both the source (Wall Street Journal portion of the Penn Treebank) and target (web) domains. |
Experiments | Participants are not allowed to use web-domain labelled data for training. |
Experiments | In addition to labelled data , a large amount of unlabelled data on the web domain is also provided. |
Introduction | The problem we face here can be considered as a special case of domain adaptation, where we have access to labelled data on the source domain (PTB) and unlabelled data on the target domain (web data). |
A multitask transfer learning solution | We now present a multitask transfer learning solution to the weakly-supervised relation extraction problem, which makes use of the labeled data from the auxiliary relation types. |
A multitask transfer learning solution | It is general for any transfer learning problem with auxiliary labeled data from similar tasks. |
Introduction | However, supervised learning heavily relies on a sufficient amount of labeled data for training, which is not always available in practice due to the labor-intensive nature of human annotation. |
Introduction | Inspired by recent work on transfer learning and domain adaptation, in this paper, we study how we can leverage labeled data of some old relation types to help the extraction of a new relation type in a weakly-supervised setting, where only a few seed instances of the new relation type are available. |
Abstract | The proposed approach trains a character-based and word-based model on labeled data , respectively, as the initial models. |
Introduction | The proposed approach begins by training a character-based and word-based model on labeled data respectively, and then both models are regularized from each view by their segmentation agreements, i.e., the identical outputs, of unlabeled data. |
Semi-supervised Learning via Co-regularizing Both Models | This study proposes a co-regularized CWS model based on character-based and word-based models, built on a small amount of segmented sentences ( labeled data ) and a large amount of raw sentences (unlabeled data). |
Semi-supervised Learning via Co-regularizing Both Models | The model induction process is described in Algorithm 1: given labeled dataset D; and unlabeled dataset Du, the first two steps are training a CRFs (character-based) and Perceptrons (word-based) model on the labeled data D; , respectively. |
Generalized Expectation Criteria | 2In general, the objective function could also include the likelihood of available labeled data , but throughout this paper we assume we have no parsed sentences. |
Generalized Expectation Criteria | If there are constraint functions G for all model feature functions F3, and the target expectations G are estimated from labeled data , then the globally optimal parameter setting under the GE obj ec-tive function is equivalent to the maximum likelihood solution. |
Linguistic Prior Knowledge | For some experiments that follow we use “oracle” constraints that are estimated from labeled data . |
Related Work | (2006) both use modified forms of self-training to bootstrap parsers from limited labeled data . |
Cross-Language Structural Correspondence Learning | The confidence, however, can only be determined for 2125 since the setting gives us access to labeled data from 8 only. |
Related Work | In the basic domain adaptation setting we are given labeled data from the source domain and unlabeled data from the target domain, and the goal is to train a classifier for the target domain. |
Related Work | Beyond this setting one can further distinguish whether a small amount of labeled data from the target domain is available (Daume, 2007; Finkel and Manning, 2009) or not (Blitzer et al., 2006; Jiang and Zhai, 2007). |
Related Work | (2007) apply structural learning to image classification in settings where little labeled data is given. |
Abstract | Standard algorithms for template-based information extraction (IE) require predefined template schemas, and often labeled data , to learn to extract their slot fillers (e.g., an embassy is the Target of a Bombing template). |
Previous Work | Weakly supervised approaches remove some of the need for fully labeled data . |
Previous Work | Shinyama and Sekine (2006) describe an approach to template learning without labeled data . |
Standard Evaluation | Our precision is as good as (and our Fl score near) two algorithms that require knowledge of the templates and/or labeled data . |
Abstract | Usually the extraction performance depends heavily on the quality and quantity of the labeled data , however, the manual annotation of a large-scale corpus is labor-intensive and time-consuming. |
Abstract | During iterations a batch of unlabeled instances are chosen in terms of their informativeness to the current classifier, labeled by an oracle and in turn added into the labeled data to retrain the classifier. |
Abstract | Input: - L, labeled data set - U, unlabeled data set - n, batch size Output: - SVM, classifier Repeat: 1. |
Abstract | Our model achieves high accuracy, without any explicitly labeled data except the user provided opinion ratings. |
Introduction | When labeled data exists, this problem can be solved effectively using a wide variety of methods available for text classification and information extraction (Manning and Schutze, 1999). |
Introduction | However, labeled data is often hard to come by, especially when one considers all possible domains of products and services. |
Experiments | 0 Self-training Segmenters (STS): two variant models were defined by the approach reported in (Subramanya et al., 2010) that uses the supervised CRFs model’s decodings, incorporating empirical and constraint information, for unlabeled examples as additional labeled data to retrain a CRFs model. |
Introduction | They leverage such mappings to either constitute a Chinese word dictionary for maximum-matching segmentation (Xu et al., 2004), or form labeled data for training a sequence labeling model (Paul et al., 2011). |
Methodology | Our learning problem belongs to semi-supervised learning (SSL), as the training is done on treebank labeled data (XL,YL) = {(X1,y1), ..., (Xl,yl)}, and bilingual unlabeled data (XU) 2 {X1, ..., Xu} where X,- = {531, ...,:cm} is an input word sequence and yi = {3/1, ...,ym}, y E T is its corresponding label sequence. |
Experiment | We use the benchmark datasets provided by the second International Chinese Word Segmentation Bakeoff2 as the labeled data . |
Our method | We randomly reuse some characters labeling ’N’ from labeled data until ratio 77 is reached. |
Our method | In summary our algorithm tackles the problem by duplicating labeled data in source domain. |
Experiments | ting an antagonistic adversary corrupt our labeled data - somewhat surprisingly, maybe - leads to better cross-domain performance. |
Introduction | The problem with out-of-vocabulary effects can be illustrated using a small labeled data set: {X1 = <1,<0,1,0>>,X2 I <1, <0,1,1>>,X3 = (0, <0,0,0>>,X4 = (1, (0,0, Say we train our model on X1_3 and evaluate it on the fourth data point. |
Robust perceptron learning | Globerson and Roweis (2006) let an adversary corrupt labeled data during training to learn better models of test data with missing features. |
Distant Supervision | To summarize, the results of our experiments using distant supervision show that a sentiment relevance classifier can be trained successfully by labeling data with a few simple feature rules, with |
Methods | Supervised optimization is impossible as we do not have any labeled data . |
Related Work | In general, it is not possible to know what the underlying concepts of a statistical classification are if no detailed annotation guidelines exist and no direct evaluation of manually labeled data is performed. |
Clustering for Cross Lingual Sentiment Analysis | Algorithm 1 Projection based on sense Input: Polarity labeled data in source language (S) and data in target language (T) to be labeled Output: Classified documents 1: Sense mark the polarity labeled data from S 2: Project the sense marked corpora from S to T using a Multidict 3: Model the sentiment classifier using the data obtained in step-2 4: Sense mark the unlabelled data from T 5: Test the sentiment classifier on data obtained in step-4 using model obtained in step-3 |
Introduction | Popular approaches for Cross-Lingual Sentiment Analysis (CLSA) (Wan, 2009; Duh et al., 2011) depend on Machine Translation (MT) for converting the labeled data from one language to the other (Hiroshi et al., 2004; Banea et al., 2008; Wan, 2009). |
Related Work | In situations where labeled data is not present in a language, approaches based on cross-lingual sentiment analysis are used. |
Abstract | The clear drawback of supervised methods is the need of training data: labeled data is expensive to obtain, and there is often a mismatch between the training data and the data the system will be applied to. |
Introduction | However, the clear drawback of supervised methods is the need of training data, which can slow down the delivery of commercial applications in new domains: labeled data is expensive to obtain, and there is often a mismatch between the training data and the data the system will be applied to. |
Related Work | These correspondences are then integrated as new features in the labeled data of the source domain. |
Experimental Settings | The dataset was collected for the purpose of constructing semantic parsers from ambiguous supervision and consists of both “noisy” and gold labeled data . |
Experimental Settings | The gold labeled labeled data consists of pairs (X, y). |
Semantic Interpretation Model | Moreover, since learning this layer is a byproduct of the learning process (as it does not use any labeled data ) forcing the connection between the decisions is the mechanism that drives learning this model. |
Experiments | Since ME-LDA used manually labeled training data for Max-Ent, we again randomly sampled 1000 terms from our corpus appearing at least 20 times and labeled them as aspect terms or sentiment terms, so this labeled data clearly has less noise than our automatically labeled data . |
Proposed Seeded Models | Note that unlike traditional Max-Ent training, we do not need manually labeled data for training (see Section 4 for details). |
Related Work | We adopt this method as well but with no use of manually labeled data in training. |