Domain Adaptation by Constraining Inter-Domain Variability of Latent Feature Representation
Titov, Ivan

Article Structure

Abstract

We consider a semi-supervised setting for domain adaptation where only unlabeled data is available for the target domain.

Introduction

Supervised learning methods have become a standard tool in natural language processing, and large training sets have been annotated for a wide variety of tasks.

The Latent Variable Model

The adaptation method advocated in this paper is applicable to any joint probabilistic model which uses distributed representations, i.e.

Constraints on Inter-Domain Variability

As we discussed in the introduction, our goal is to provide a method for domain adaptation based on semi-supervised learning of models with distributed representations.

Learning and Inference

In this section we describe an approximate learning algorithm based on the mean-field approximation.

Empirical Evaluation

In this section we empirically evaluate our approach on the sentiment classification task.

Related Work

There is a growing body of work on domain adaptation.

Discussion and Conclusions

In this paper we presented a domain-adaptation method based on semi-supervised learning with distributed representations coupled with constraints favoring domain-independence of modeled phenomena.

Topics

latent variables

Appears in 23 sentences as: latent variable (10) latent variables (14)
In Domain Adaptation by Constraining Inter-Domain Variability of Latent Feature Representation
  1. One way to tackle this problem is to train a generative model with latent variables on the mixture of data from the source and target domains.
    Page 1, “Abstract”
  2. Such a model would cluster features in both domains and ensure that at least some of the latent variables are predictive of the label on the source domain.
    Page 1, “Abstract”
  3. We introduce a constraint enforcing that marginal distributions of each cluster (i.e., each latent variable ) do not vary significantly across domains.
    Page 1, “Abstract”
  4. We use generative latent variable models (LVMs) learned on all the available data: unlabeled data for both domains and on the labeled data for the source domain.
    Page 1, “Introduction”
  5. The latent variables encode regularities observed on unlabeled data from both domains, and they are learned to be predictive of the labels on the source domain.
    Page 2, “Introduction”
  6. The danger of this semi-supervised approach in the domain-adaptation setting is that some of the latent variables will correspond to clusters of features specific only to the source domain, and consequently, the classifier relying on this latent variable will be badly affected when tested on the target domain.
    Page 2, “Introduction”
  7. We encode this intuition by introducing a term in the learning objective which regularizes inter-domain difference in marginal distributions of each latent variable .
    Page 2, “Introduction”
  8. In our experiments, we use a form of Harmonium Model (Smolensky, 1986) with a single layer of binary latent variables .
    Page 2, “Introduction”
  9. In Section 2 we introduce a model which uses vectors of latent variables to model statistical dependencies between the elementary features.
    Page 2, “Introduction”
  10. vectors of latent variables , to abstract away from handcrafted features.
    Page 2, “The Latent Variable Model”
  11. The model assumes that the features and the latent variable vector are generated jointly from a globally-normalized model and then the label 3/ is generated from a conditional distribution dependent on z.
    Page 3, “The Latent Variable Model”

See all papers in Proc. ACL 2011 that mention latent variables.

See all papers in Proc. ACL that mention latent variables.

Back to top.

semi-supervised

Appears in 13 sentences as: Semi-supervised (1) semi-supervised (12)
In Domain Adaptation by Constraining Inter-Domain Variability of Latent Feature Representation
  1. We consider a semi-supervised setting for domain adaptation where only unlabeled data is available for the target domain.
    Page 1, “Abstract”
  2. The danger of this semi-supervised approach in the domain-adaptation setting is that some of the latent variables will correspond to clusters of features specific only to the source domain, and consequently, the classifier relying on this latent variable will be badly affected when tested on the target domain.
    Page 2, “Introduction”
  3. As we discussed in the introduction, our goal is to provide a method for domain adaptation based on semi-supervised learning of models with distributed representations.
    Page 3, “Constraints on Inter-Domain Variability”
  4. In this section, we first discuss the shortcomings of domain adaptation with the above-described semi-supervised approach and motivate constraints on inter-domain variability of
    Page 3, “Constraints on Inter-Domain Variability”
  5. For every pair, the semi-supervised methods use labeled data from the source domain and unlabeled data from both domains.
    Page 7, “Empirical Evaluation”
  6. All the methods, supervised and semi-supervised , are based on the model described in Section 2.
    Page 7, “Empirical Evaluation”
  7. This does not seem to have an adverse effect on the accuracy but makes learning very efficient: the average training time for the semi-supervised methods was about 20 minutes on a standard PC.
    Page 7, “Empirical Evaluation”
  8. In our case, due to joint learning and non-convexity of the learning problem, this approach would be problematic.4 Instead, we combine predictions of the semi-supervised models Reg and NoReg with the baseline out-of-domain model (Base) using the product-of-experts combination (Hinton, 2002), the corresponding methods are called Reg+ and N0Reg+, respectively.
    Page 7, “Empirical Evaluation”
  9. (2007) are slightly worse than those demonstrated in our experiments both for supervised and semi-supervised methods.
    Page 8, “Empirical Evaluation”
  10. Various semi-supervised techniques for domain-adaptation have also been considered, one example being self-training (McClosky et al., 2006).
    Page 9, “Related Work”
  11. Semi-supervised leam-ing with distributed representations and its application to domain adaptation has previously been considered in (Huang and Yates, 2009), but no attempt has been made to address problems specific to the domain-adaptation setting.
    Page 9, “Related Work”

See all papers in Proc. ACL 2011 that mention semi-supervised.

See all papers in Proc. ACL that mention semi-supervised.

Back to top.

unlabeled data

Appears in 12 sentences as: unlabeled data (12)
In Domain Adaptation by Constraining Inter-Domain Variability of Latent Feature Representation
  1. We consider a semi-supervised setting for domain adaptation where only unlabeled data is available for the target domain.
    Page 1, “Abstract”
  2. In addition to the labeled data from the source domain, they also exploit small amounts of labeled data and/or unlabeled data from the target domain to estimate a more predictive model for the target domain.
    Page 1, “Introduction”
  3. In this paper we focus on a more challenging and arguably more realistic version of the domain-adaptation problem where only unlabeled data is available for the target domain.
    Page 1, “Introduction”
  4. (2006) use auxiliary tasks based on unlabeled data for both domains (called pivot features) and a dimensionality reduction technique to induce such shared representation.
    Page 1, “Introduction”
  5. We use generative latent variable models (LVMs) learned on all the available data: unlabeled data for both domains and on the labeled data for the source domain.
    Page 1, “Introduction”
  6. The latent variables encode regularities observed on unlabeled data from both domains, and they are learned to be predictive of the labels on the source domain.
    Page 2, “Introduction”
  7. and unlabeled data for the source and target domain {m(l)}l€3UuTU, where SU and TU stand for the unlabeled datasets for the source and target domains, respectively.
    Page 3, “The Latent Variable Model”
  8. However, given that, first, amount of unlabeled data |SU U TU| normally vastly exceeds the amount of labeled data |SL| and, second, the number of features for each example |a3(l)| is usually large, the label y will have only a minor effect on the mapping from the initial features a: to the latent representation z (i.e.
    Page 3, “The Latent Variable Model”
  9. Intuitively, maximizing the likelihood of unlabeled data is closely related to minimizing the reconstruction error, that is training a model to discover such mapping parameters u that z encodes all the necessary information to accurately reproduce :13“) from z for every training example :3“).
    Page 6, “Learning and Inference”
  10. For every pair, the semi-supervised methods use labeled data from the source domain and unlabeled data from both domains.
    Page 7, “Empirical Evaluation”
  11. Also, it is important to point out that the SCL method uses auxiliary tasks to induce the shared feature representation, these tasks are constructed on the basis of unlabeled data .
    Page 8, “Empirical Evaluation”

See all papers in Proc. ACL 2011 that mention unlabeled data.

See all papers in Proc. ACL that mention unlabeled data.

Back to top.

labeled data

Appears in 9 sentences as: labeled data (9) labelled data (1)
In Domain Adaptation by Constraining Inter-Domain Variability of Latent Feature Representation
  1. In addition to the labeled data from the source domain, they also exploit small amounts of labeled data and/or unlabeled data from the target domain to estimate a more predictive model for the target domain.
    Page 1, “Introduction”
  2. We use generative latent variable models (LVMs) learned on all the available data: unlabeled data for both domains and on the labeled data for the source domain.
    Page 1, “Introduction”
  3. 1Among the versions which do not exploit labeled data from the target domain.
    Page 2, “The Latent Variable Model”
  4. The parameters of this model 6 = (12,10) can be estimated by maximizing joint likelihood L(6) of labeled data for the source domain {330), y(l)}l€3L
    Page 3, “The Latent Variable Model”
  5. However, given that, first, amount of unlabeled data |SU U TU| normally vastly exceeds the amount of labeled data |SL| and, second, the number of features for each example |a3(l)| is usually large, the label y will have only a minor effect on the mapping from the initial features a: to the latent representation z (i.e.
    Page 3, “The Latent Variable Model”
  6. For every pair, the semi-supervised methods use labeled data from the source domain and unlabeled data from both domains.
    Page 7, “Empirical Evaluation”
  7. We compare them with two supervised methods: a supervised model (Base) which is trained on the source domain data only, and another supervised model (In-domain) which is learned on the labeled data from the target domain.
    Page 7, “Empirical Evaluation”
  8. Second, their expectation constraints are estimated from labeled data , whereas we are trying to match expectations computed on unlabeled data for two domains.
    Page 9, “Related Work”
  9. This approach bears some similarity to the adaptation methods standard for the setting where labelled data is available for both domains (Chelba and Acero, 2004; Daume and Marcu, 2006).
    Page 9, “Related Work”

See all papers in Proc. ACL 2011 that mention labeled data.

See all papers in Proc. ACL that mention labeled data.

Back to top.

domain adaptation

Appears in 8 sentences as: domain adaptation (8)
In Domain Adaptation by Constraining Inter-Domain Variability of Latent Feature Representation
  1. We consider a semi-supervised setting for domain adaptation where only unlabeled data is available for the target domain.
    Page 1, “Abstract”
  2. One of the most promising research directions on domain adaptation for this setting is based on the idea of inducing a shared feature representation (Blitzer et al., 2006), that is mapping from the initial feature representation to a new representation such that (l) examples from both domains ‘look similar’ and (2) an accurate classifier can be trained in this new representation.
    Page 1, “Introduction”
  3. As we discussed in the introduction, our goal is to provide a method for domain adaptation based on semi-supervised learning of models with distributed representations.
    Page 3, “Constraints on Inter-Domain Variability”
  4. In this section, we first discuss the shortcomings of domain adaptation with the above-described semi-supervised approach and motivate constraints on inter-domain variability of
    Page 3, “Constraints on Inter-Domain Variability”
  5. Another motivation for the form of regularization we propose originates from theoretical analysis of the domain adaptation problems (Ben-David et al., 2010; Mansour et al., 2009; Blitzer et al., 2007).
    Page 4, “Constraints on Inter-Domain Variability”
  6. There is a growing body of work on domain adaptation .
    Page 8, “Related Work”
  7. Such methods tackle domain adaptation by instance re-weighting (Bickel et al., 2007; Jiang and Zhai, 2007), or, similarly, by feature re-weighting (Sat-pal and Sarawagi, 2007).
    Page 8, “Related Work”
  8. Semi-supervised leam-ing with distributed representations and its application to domain adaptation has previously been considered in (Huang and Yates, 2009), but no attempt has been made to address problems specific to the domain-adaptation setting.
    Page 9, “Related Work”

See all papers in Proc. ACL 2011 that mention domain adaptation.

See all papers in Proc. ACL that mention domain adaptation.

Back to top.

classification task

Appears in 7 sentences as: classification task (6) classification tasks (1)
In Domain Adaptation by Constraining Inter-Domain Variability of Latent Feature Representation
  1. We show that this constraint is effective on the sentiment classification task (Fang et al., 2002), resulting in scores similar to the ones obtained by the structural correspondence methods (Blitzer et al., 2007) without the need to engineer auxiliary tasks.
    Page 1, “Abstract”
  2. In this paper we consider classification tasks , namely prediction of sentiment polarity of a user review (Pang et al., 2002), and model the joint distribution of the binary sentiment label 3/ E {0, l} and the multiset of text features :13, :ci 6 X.
    Page 3, “The Latent Variable Model”
  3. Consequently, the latent representation induced in this way is likely to be inappropriate for the classification task in question.
    Page 3, “The Latent Variable Model”
  4. At least some of these clusters, when induced by maximizing the likelihood L(6, a) with sufficiently large 04, will be useful for the classification task on the source domain.
    Page 4, “Constraints on Inter-Domain Variability”
  5. In this section we empirically evaluate our approach on the sentiment classification task .
    Page 6, “Empirical Evaluation”
  6. On the sentiment classification task in order to construct them two steps need to be performed: (1) a set of words correlated with the sentiment label is selected, and, then (2) prediction of each such word is regarded a distinct auxiliary problem.
    Page 8, “Empirical Evaluation”
  7. Our approach results in competitive domain-adaptation performance on the sentiment classification task , rivalling that of the state-of-the-art SCL method (Blitzer et al., 2007).
    Page 9, “Discussion and Conclusions”

See all papers in Proc. ACL 2011 that mention classification task.

See all papers in Proc. ACL that mention classification task.

Back to top.

distributed representations

Appears in 5 sentences as: distributed representation (1) distributed representations (4)
In Domain Adaptation by Constraining Inter-Domain Variability of Latent Feature Representation
  1. Such LVMs can be regarded as composed of two parts: a mapping from initial (normally, word-based) representation to a new shared distributed representation , and also a classifier in this representation.
    Page 2, “Introduction”
  2. The adaptation method advocated in this paper is applicable to any joint probabilistic model which uses distributed representations , i.e.
    Page 2, “The Latent Variable Model”
  3. As we discussed in the introduction, our goal is to provide a method for domain adaptation based on semi-supervised learning of models with distributed representations .
    Page 3, “Constraints on Inter-Domain Variability”
  4. Semi-supervised leam-ing with distributed representations and its application to domain adaptation has previously been considered in (Huang and Yates, 2009), but no attempt has been made to address problems specific to the domain-adaptation setting.
    Page 9, “Related Work”
  5. In this paper we presented a domain-adaptation method based on semi-supervised learning with distributed representations coupled with constraints favoring domain-independence of modeled phenomena.
    Page 9, “Discussion and Conclusions”

See all papers in Proc. ACL 2011 that mention distributed representations.

See all papers in Proc. ACL that mention distributed representations.

Back to top.

sentiment classification

Appears in 5 sentences as: sentiment classification (4) sentiment classifiers (1)
In Domain Adaptation by Constraining Inter-Domain Variability of Latent Feature Representation
  1. We show that this constraint is effective on the sentiment classification task (Fang et al., 2002), resulting in scores similar to the ones obtained by the structural correspondence methods (Blitzer et al., 2007) without the need to engineer auxiliary tasks.
    Page 1, “Abstract”
  2. We evaluate our approach on adapting sentiment classifiers on 4 domains: books, DVDs, electronics and kitchen appliances (Blitzer et al., 2007).
    Page 2, “Introduction”
  3. In this section we empirically evaluate our approach on the sentiment classification task.
    Page 6, “Empirical Evaluation”
  4. On the sentiment classification task in order to construct them two steps need to be performed: (1) a set of words correlated with the sentiment label is selected, and, then (2) prediction of each such word is regarded a distinct auxiliary problem.
    Page 8, “Empirical Evaluation”
  5. Our approach results in competitive domain-adaptation performance on the sentiment classification task, rivalling that of the state-of-the-art SCL method (Blitzer et al., 2007).
    Page 9, “Discussion and Conclusions”

See all papers in Proc. ACL 2011 that mention sentiment classification.

See all papers in Proc. ACL that mention sentiment classification.

Back to top.

In-domain

Appears in 4 sentences as: In-domain (3) in-domain (1)
In Domain Adaptation by Constraining Inter-Domain Variability of Latent Feature Representation
  1. We compare them with two supervised methods: a supervised model (Base) which is trained on the source domain data only, and another supervised model ( In-domain ) which is learned on the labeled data from the target domain.
    Page 7, “Empirical Evaluation”
  2. The Base model can be regarded as a natural baseline model, whereas the In-domain model is essentially an upper-bound for any domain-adaptation method.
    Page 7, “Empirical Evaluation”
  3. First, observe that the total drop in the accuracy when moving to the target domain is 8.9%: from 84.6% demonstrated by the In-domain classifier to 75.6% shown by the non-adapted Base classifier.
    Page 8, “Empirical Evaluation”
  4. 5 The drop in accuracy for the SCL method in Table 1 is is computed with respect to the less accurate supervised in-domain classifier considered in Blitzer et a1.
    Page 8, “Related Work”

See all papers in Proc. ACL 2011 that mention In-domain.

See all papers in Proc. ACL that mention In-domain.

Back to top.

learning algorithm

Appears in 4 sentences as: learning algorithm (3) learning algorithms (1)
In Domain Adaptation by Constraining Inter-Domain Variability of Latent Feature Representation
  1. However, most learning algorithms operate under assumption that the learning data originates from the same distribution as the test data, though in practice this assumption is often violated.
    Page 1, “Introduction”
  2. We explain how the introduced regularizer can be integrated into the stochastic gradient descent learning algorithm for our model.
    Page 2, “Introduction”
  3. In this section we describe an approximate learning algorithm based on the mean-field approximation.
    Page 5, “Learning and Inference”
  4. Though we believe that our approach is independent of the specific learning algorithm , we provide the description for completeness.
    Page 5, “Learning and Inference”

See all papers in Proc. ACL 2011 that mention learning algorithm.

See all papers in Proc. ACL that mention learning algorithm.

Back to top.

objective function

Appears in 3 sentences as: objective function (2) objective function: (1)
In Domain Adaptation by Constraining Inter-Domain Variability of Latent Feature Representation
  1. We augment the multi-conditional log-likelihood L(6, 04) with the weighted regularization term G(6) to get the composite objective function:
    Page 5, “Constraints on Inter-Domain Variability”
  2. The stochastic gradient descent algorithm iterates over examples and updates the weight vector based on the contribution of every considered example to the objective function L R(6, 04, 6).
    Page 5, “Learning and Inference”
  3. The initial learning rate and the weight decay (the inverse squared variance of the Gaussian prior) were set to 0.01, and both parameters were reduced by the factor of 2 every iteration the objective function estimate went down.
    Page 7, “Empirical Evaluation”

See all papers in Proc. ACL 2011 that mention objective function.

See all papers in Proc. ACL that mention objective function.

Back to top.