When Specialists and Generalists Work Together: Overcoming Domain Dependence in Sentiment Tagging
Andreevskaia, Alina and Bergler, Sabine

Article Structure

Abstract

This study presents a novel approach to the problem of system portability across different domains: a sentiment annotation system that integrates a corpus-based classifier trained on a small set of annotated in-domain data and a lexicon-based system trained on WordNet.

Introduction

One of the emerging directions in NLP is the development of machine learning methods that perform well not only on the domain on which they were trained, but also on other domains, for which training data is not available or is not sufficient to ensure adequate machine learning.

Domain Adaptation in Sentiment Research

Most text-level sentiment classifiers use standard machine learning techniques to learn and select features from labeled corpora.

Factors Affecting System Performance

The comparison of system performance across different domains involves a number of factors that can significantly affect system performance — from training set size to level of analysis (sentence or entire document), document domairfl genre and many other factors.

Experiments

4.1 System Performance on Texts vs. Sentences

Lexicon-Based Approach

The search for a base-learner that can produce greatest synergies with a classifier trained on small-set in-domain data has turned our attention to lexicon-based systems.

Integrating the Corpus-based and Dictionary-based Approaches

The strategy of integration of two or more systems in a single ensemble of classifiers has been actively used on different tasks within NLP.

Discussion

The development of domain-independent sentiment determination systems poses a substantial challenge for researchers in NLP and artificial intelligence.

Conclusion

This study contributes to the research on sentiment tagging, domain adaptation, and the development of ensembles of classifiers (l) by proposing a novel approach for sentiment determination at sentence level and delineating the conditions under which greatest synergies among combined classifiers can be achieved, (2) by describing a precision-based technique for assigning differential weights to classifier results on different categories identified by the classifier (i.e., categories of positive vs. negative sentences), and (3) by proposing a new method for sentiment annotation in situations where the annotated in-domain data is scarce and insufficient to ensure adequate performance of the corpus-based classifier, which still remains the preferred choice when large volumes of annotated data are available for system training.

Topics

in-domain

Appears in 26 sentences as: in-domain (28)
In When Specialists and Generalists Work Together: Overcoming Domain Dependence in Sentiment Tagging
  1. This study presents a novel approach to the problem of system portability across different domains: a sentiment annotation system that integrates a corpus-based classifier trained on a small set of annotated in-domain data and a lexicon-based system trained on WordNet.
    Page 1, “Abstract”
  2. The paper explores the challenges of system portability across domains and text genres (movie reviews, news, blogs, and product reviews), highlights the factors affecting system performance on out-of-domain and small—set in-domain data, and presents a new system consisting of the ensemble of two classifiers with precision-based vote weighting, that provides significant gains in accuracy and recall over the corpus-based classifier and the lexicon-based system taken individually.
    Page 1, “Abstract”
  3. Many applications require reliable processing of heterogeneous corpora, such as the World Wide Web, where the diversity of genres and domains present in the Internet limits the feasibility of in-domain training.
    Page 1, “Introduction”
  4. A number of methods has been proposed in order to overcome this system portability limitation by using out-of-domain data, unlabelled in-domain corpora or a combination of in-domain and out-of-domain examples (Aue and Gamon, 2005; Bai et al., 2005; Drezde et al., 2007; Tan et al., 2007).
    Page 1, “Introduction”
  5. The information contained in lexicographical sources, such as WordNet, reflects a lay person’s general knowledge about the world, while domain-specific knowledge can be acquired through classifier training on a small set of in-domain data.
    Page 1, “Introduction”
  6. The final, third part of the paper presents our system, composed of an ensemble of two classifiers —one trained on WordNet glosses and synsets and the other trained on a small in-domain training set.
    Page 1, “Introduction”
  7. There are two alternatives to supervised machine learning that can be used to get around this problem: on the one hand, general lists of sentiment clues/features can be acquired from domain-independent sources such as dictionaries or the Internet, on the other hand, unsupervised and weakly-supervised approaches can be used to take advantage of a small number of annotated in-domain examples and/or of unlabelled in-domain data.
    Page 2, “Domain Adaptation in Sentiment Research”
  8. But such general word lists were shown to perform worse than statistical models built on sufficiently large in-domain training sets of movie reviews (Pang et al., 2002).
    Page 2, “Domain Adaptation in Sentiment Research”
  9. For instance, Aue and Gamon (2005) proposed training on a samll number of labeled examples and large quantities of unlabelled in-domain data.
    Page 2, “Domain Adaptation in Sentiment Research”
  10. This system performed well even when compared to systems trained on a large set of in-domain examples: on feedback messages from a web survey on knowledge bases, Aue and Gamon report 73.86% accuracy using unlabelled data compared to 77.34% for
    Page 2, “Domain Adaptation in Sentiment Research”
  11. in-domain and 72.39% for the best out-of-domain training on a large training set.
    Page 2, “Domain Adaptation in Sentiment Research”

See all papers in Proc. ACL 2008 that mention in-domain.

See all papers in Proc. ACL that mention in-domain.

Back to top.

unigrams

Appears in 14 sentences as: unigram (2) unigrams (13)
In When Specialists and Generalists Work Together: Overcoming Domain Dependence in Sentiment Tagging
  1. System runs with unigrams , bigrams, and trigrams as features and with different training set sizes are presented.
    Page 4, “Factors Affecting System Performance”
  2. Consistent with findings in the literature (Cui et al., 2006; Dave et al., 2003; Gamon and Aue, 2005), on the large corpus of movie review texts, the in-domain-trained system based solely on unigrams had lower accuracy than the similar system trained on bigrams.
    Page 4, “Experiments”
  3. On sentences, however, we have observed an inverse pattern: unigrams performed better than bigrams and trigrams.
    Page 4, “Experiments”
  4. Due to lower frequency of higher-order n-grams (as opposed to unigrams ), higher-order n-gram language models are more sparse, which increases the probability of missing a particular sentiment marker in a sentence (Table 33).
    Page 4, “Experiments”
  5. Dataset Movie News Blogs PRs Dataset size 1066 800 800 1200 unigrams SVM 68.5 61.5 63.85 76.9 NB 60.2 59.5 60.5 74.25 nb features 5410 4544 3615 2832 bigrams SVM 59.9 63.2 61.5 75.9 NB 57.0 58.4 59.5 67.8 nb features 16286 14633 15182 12951 trigrams SVM 54.3 55.4 52.7 64.4 NB 53.3 57.0 56.0 69.7 nb features 20837 18738 19847 19132
    Page 5, “Experiments”
  6. Table 3: Accuracy of unigram , bigram and trigram models across domains.
    Page 5, “Experiments”
  7. Table 4: Accuracy of SVM with unigram model
    Page 5, “Experiments”
  8. results depends on the genre and size of the n-gram: on product reviews, all results are statistically significant at oz 2 0.025 level; on movie reviews, the difference between NaVe Bayes and SVM is statistically significant at oz 2 0.01 but the significance diminishes as the size of the n- gram increases; on news, only bigrams produce a statistically significant (a = 0.01) difference between the two machine learning methods, while on blogs the difference between SVMs and NaVe Bayes is most pronounced when unigrams are used (a = 0.025).
    Page 5, “Experiments”
  9. It is interesting to note that on sentences, regardless of the domain used in system training and regardless of the domain used in system testing, unigrams tend to perform better than higher-order n-grams.
    Page 5, “Experiments”
  10. One of the limitations of general lexicons and dictionaries, such as WordNet (Fellbaum, 1998), as training sets for sentiment tagging systems is that they contain only definitions of individual words and, hence, only unigrams could be effectively learned from dictionary entries.
    Page 5, “Lexicon-Based Approach”
  11. Since the structure of WordNet glosses is fairly different from that of other types of corpora, we developed a system that used the list of human-annotated adjectives from (Hatzivassiloglou and McKeown, 1997) as a seed list and then learned additional unigrams
    Page 5, “Lexicon-Based Approach”

See all papers in Proc. ACL 2008 that mention unigrams.

See all papers in Proc. ACL that mention unigrams.

Back to top.

SVM

Appears in 10 sentences as: SVM (12)
In When Specialists and Generalists Work Together: Overcoming Domain Dependence in Sentiment Tagging
  1. They applied an out-of-domain-trained SVM classifier to label examples from the target domain and then retrained the classifier using these new examples.
    Page 2, “Domain Adaptation in Sentiment Research”
  2. Depending on the similarity between domains, this method brought up to 15% gain compared to the baseline SVM .
    Page 2, “Domain Adaptation in Sentiment Research”
  3. To our knowledge, the only work that describes the application of statistical classifiers ( SVM ) to sentence-level sentiment classification is (Gamon and Aue, 2005)1.
    Page 4, “Factors Affecting System Performance”
  4. Dataset Movie News Blogs PRs Dataset size 1066 800 800 1200 unigrams SVM 68.5 61.5 63.85 76.9 NB 60.2 59.5 60.5 74.25 nb features 5410 4544 3615 2832 bigrams SVM 59.9 63.2 61.5 75.9 NB 57.0 58.4 59.5 67.8 nb features 16286 14633 15182 12951 trigrams SVM 54.3 55.4 52.7 64.4 NB 53.3 57.0 56.0 69.7 nb features 20837 18738 19847 19132
    Page 5, “Experiments”
  5. Table 4: Accuracy of SVM with unigram model
    Page 5, “Experiments”
  6. results depends on the genre and size of the n-gram: on product reviews, all results are statistically significant at oz 2 0.025 level; on movie reviews, the difference between NaVe Bayes and SVM is statistically significant at oz 2 0.01 but the significance diminishes as the size of the n- gram increases; on news, only bigrams produce a statistically significant (a = 0.01) difference between the two machine learning methods, while on blogs the difference between SVMs and NaVe Bayes is most pronounced when unigrams are used (a = 0.025).
    Page 5, “Experiments”
  7. The baseline performance of the Lexicon-Based System (LBS) described above is presented in Table 5, along with the performance results of the in-domain- and out-of-domain-trained SVM classifier.
    Page 6, “Lexicon-Based Approach”
  8. Movies ‘ News ‘ Blogs ‘ PRs LBS 57.5 62.3 63.3 59.3 SVM in-dom.
    Page 6, “Lexicon-Based Approach”
  9. 68.5 61.5 63.85 76.9 SVM out-of-dom.
    Page 6, “Lexicon-Based Approach”
  10. Using then an SVM meta-classifier trained on a small number of target domain examples to combine the nine base classifiers, they obtained a statistically significant improvement on out-of-domain texts from book reviews, knowledge-base feedback, and product support services survey data.
    Page 6, “Integrating the Corpus-based and Dictionary-based Approaches”

See all papers in Proc. ACL 2008 that mention SVM.

See all papers in Proc. ACL that mention SVM.

Back to top.

WordNet

Appears in 9 sentences as: WordNet (9)
In When Specialists and Generalists Work Together: Overcoming Domain Dependence in Sentiment Tagging
  1. This study presents a novel approach to the problem of system portability across different domains: a sentiment annotation system that integrates a corpus-based classifier trained on a small set of annotated in-domain data and a lexicon-based system trained on WordNet .
    Page 1, “Abstract”
  2. In this paper, we present a novel approach to the problem of system portability across different domains by developing a sentiment annotation system that integrates a corpus-based classifier with a lexicon-based system trained on WordNet .
    Page 1, “Introduction”
  3. The information contained in lexicographical sources, such as WordNet , reflects a lay person’s general knowledge about the world, while domain-specific knowledge can be acquired through classifier training on a small set of in-domain data.
    Page 1, “Introduction”
  4. The final, third part of the paper presents our system, composed of an ensemble of two classifiers —one trained on WordNet glosses and synsets and the other trained on a small in-domain training set.
    Page 1, “Introduction”
  5. A lexicon-based approach capitalizes on the fact that dictionaries, such as WordNet (Fellbaum, 1998), contain a comprehensive and domain-independent set of sentiment clues that exist in general English.
    Page 5, “Lexicon-Based Approach”
  6. One of the limitations of general lexicons and dictionaries, such as WordNet (Fellbaum, 1998), as training sets for sentiment tagging systems is that they contain only definitions of individual words and, hence, only unigrams could be effectively learned from dictionary entries.
    Page 5, “Lexicon-Based Approach”
  7. Since the structure of WordNet glosses is fairly different from that of other types of corpora, we developed a system that used the list of human-annotated adjectives from (Hatzivassiloglou and McKeown, 1997) as a seed list and then learned additional unigrams
    Page 5, “Lexicon-Based Approach”
  8. from WordNet synsets and glosses with up to 88% accuracy, when evaluated against General Inquirer (Stone et al., 1966) (GI) on the intersection of our automatically acquired list with GI.
    Page 6, “Lexicon-Based Approach”
  9. The resulting measure, termed Net Overlap Score (NOS), reflected the number of ties linking a given word with other sentiment-laden words in WordNet , and hence, could be used as a measure of the words’ centrality in the fuzzy category of sentiment.
    Page 6, “Lexicon-Based Approach”

See all papers in Proc. ACL 2008 that mention WordNet.

See all papers in Proc. ACL that mention WordNet.

Back to top.

bigrams

Appears in 8 sentences as: bigram (1) bigrams (7)
In When Specialists and Generalists Work Together: Overcoming Domain Dependence in Sentiment Tagging
  1. System runs with unigrams, bigrams , and trigrams as features and with different training set sizes are presented.
    Page 4, “Factors Affecting System Performance”
  2. Consistent with findings in the literature (Cui et al., 2006; Dave et al., 2003; Gamon and Aue, 2005), on the large corpus of movie review texts, the in-domain-trained system based solely on unigrams had lower accuracy than the similar system trained on bigrams .
    Page 4, “Experiments”
  3. But the trigrams fared slightly worse than bigrams .
    Page 4, “Experiments”
  4. On sentences, however, we have observed an inverse pattern: unigrams performed better than bigrams and trigrams.
    Page 4, “Experiments”
  5. Dataset Movie News Blogs PRs Dataset size 1066 800 800 1200 unigrams SVM 68.5 61.5 63.85 76.9 NB 60.2 59.5 60.5 74.25 nb features 5410 4544 3615 2832 bigrams SVM 59.9 63.2 61.5 75.9 NB 57.0 58.4 59.5 67.8 nb features 16286 14633 15182 12951 trigrams SVM 54.3 55.4 52.7 64.4 NB 53.3 57.0 56.0 69.7 nb features 20837 18738 19847 19132
    Page 5, “Experiments”
  6. Table 3: Accuracy of unigram, bigram and trigram models across domains.
    Page 5, “Experiments”
  7. results depends on the genre and size of the n-gram: on product reviews, all results are statistically significant at oz 2 0.025 level; on movie reviews, the difference between NaVe Bayes and SVM is statistically significant at oz 2 0.01 but the significance diminishes as the size of the n- gram increases; on news, only bigrams produce a statistically significant (a = 0.01) difference between the two machine learning methods, while on blogs the difference between SVMs and NaVe Bayes is most pronounced when unigrams are used (a = 0.025).
    Page 5, “Experiments”
  8. In the ensemble of classifiers, they used a combination of nine SVM-based classifiers deployed to learn unigrams, bigrams , and trigrams on three different domains, while the fourth domain was used as an evaluation set.
    Page 6, “Integrating the Corpus-based and Dictionary-based Approaches”

See all papers in Proc. ACL 2008 that mention bigrams.

See all papers in Proc. ACL that mention bigrams.

Back to top.

machine learning

Appears in 7 sentences as: machine learning (8)
In When Specialists and Generalists Work Together: Overcoming Domain Dependence in Sentiment Tagging
  1. One of the emerging directions in NLP is the development of machine learning methods that perform well not only on the domain on which they were trained, but also on other domains, for which training data is not available or is not sufficient to ensure adequate machine learning .
    Page 1, “Introduction”
  2. Most text-level sentiment classifiers use standard machine learning techniques to learn and select features from labeled corpora.
    Page 2, “Domain Adaptation in Sentiment Research”
  3. There are two alternatives to supervised machine learning that can be used to get around this problem: on the one hand, general lists of sentiment clues/features can be acquired from domain-independent sources such as dictionaries or the Internet, on the other hand, unsupervised and weakly-supervised approaches can be used to take advantage of a small number of annotated in-domain examples and/or of unlabelled in-domain data.
    Page 2, “Domain Adaptation in Sentiment Research”
  4. On other domains, such as product reviews, the performance of systems that use general word lists is comparable to the performance of supervised machine learning approaches (Gamon and Aue, 2005).
    Page 2, “Domain Adaptation in Sentiment Research”
  5. The recognition of major performance deficiencies of supervised machine learning methods with insufficient or out-of-domain training brought about an increased interest in unsupervised and weakly-supervised approaches to feature learning.
    Page 2, “Domain Adaptation in Sentiment Research”
  6. results depends on the genre and size of the n-gram: on product reviews, all results are statistically significant at oz 2 0.025 level; on movie reviews, the difference between NaVe Bayes and SVM is statistically significant at oz 2 0.01 but the significance diminishes as the size of the n- gram increases; on news, only bigrams produce a statistically significant (a = 0.01) difference between the two machine learning methods, while on blogs the difference between SVMs and NaVe Bayes is most pronounced when unigrams are used (a = 0.025).
    Page 5, “Experiments”
  7. For this reason, the numbers reported for the corpus-based classifier do not reflect the full potential of machine learning approaches when sufficient in-domain training data is available.
    Page 7, “Integrating the Corpus-based and Dictionary-based Approaches”

See all papers in Proc. ACL 2008 that mention machine learning.

See all papers in Proc. ACL that mention machine learning.

Back to top.

statistically significant

Appears in 6 sentences as: statistical significance (1) statistically significant (9)
In When Specialists and Generalists Work Together: Overcoming Domain Dependence in Sentiment Tagging
  1. 2All results are statistically significant at oz 2 0.01 with two exceptions: the difference between tri grams and bi grams for the system trained and tested on texts is statistically significant at alpha=0.l and for the system trained on sentences and tested on texts is not statistically significant at oz 2 0.01.
    Page 4, “Experiments”
  2. The statistical significance of the
    Page 4, “Experiments”
  3. results depends on the genre and size of the n-gram: on product reviews, all results are statistically significant at oz 2 0.025 level; on movie reviews, the difference between NaVe Bayes and SVM is statistically significant at oz 2 0.01 but the significance diminishes as the size of the n- gram increases; on news, only bigrams produce a statistically significant (a = 0.01) difference between the two machine learning methods, while on blogs the difference between SVMs and NaVe Bayes is most pronounced when unigrams are used (a = 0.025).
    Page 5, “Experiments”
  4. Using then an SVM meta-classifier trained on a small number of target domain examples to combine the nine base classifiers, they obtained a statistically significant improvement on out-of-domain texts from book reviews, knowledge-base feedback, and product support services survey data.
    Page 6, “Integrating the Corpus-based and Dictionary-based Approaches”
  5. The results reported in Table 6 are statistically significant at 04 = 0.01.
    Page 7, “Integrating the Corpus-based and Dictionary-based Approaches”
  6. are statistically significant at 04 = 0.01, except the runs on movie reviews where the difference between the LBS and Ensemble classifiers was significant at 04 = 0.05.
    Page 8, “Integrating the Corpus-based and Dictionary-based Approaches”

See all papers in Proc. ACL 2008 that mention statistically significant.

See all papers in Proc. ACL that mention statistically significant.

Back to top.

sentence-level

Appears in 5 sentences as: sentence-level (5)
In When Specialists and Generalists Work Together: Overcoming Domain Dependence in Sentiment Tagging
  1. To our knowledge, the only work that describes the application of statistical classifiers (SVM) to sentence-level sentiment classification is (Gamon and Aue, 2005)1.
    Page 4, “Factors Affecting System Performance”
  2. These results highlight a special property of sentence-level annotation: greater sensitivity to sparseness of the model: On texts, classifier error on one particular sentiment marker is often compensated by a number of correctly identified other sentiment clues.
    Page 4, “Experiments”
  3. Since sentences usually contain a much smaller number of sentiment clues than texts, sentence-level annotation more readily yields errors when a single sentiment clue is incorrectly identified or missed by the system.
    Page 4, “Experiments”
  4. training sets are required to overcome this higher n-gram sparseness in sentence-level annotation.
    Page 5, “Experiments”
  5. This observation suggests that, given the constraints on the size of the available training sets, unigram-based systems may be better suited for sentence-level sentiment annotation.
    Page 5, “Experiments”

See all papers in Proc. ACL 2008 that mention sentence-level.

See all papers in Proc. ACL that mention sentence-level.

Back to top.

domain adaptation

Appears in 4 sentences as: domain adaptation (4)
In When Specialists and Generalists Work Together: Overcoming Domain Dependence in Sentiment Tagging
  1. The first part of this paper reviews the extant literature on domain adaptation in sentiment analysis and highlights promising directions for research.
    Page 1, “Introduction”
  2. (2007) applied structural correspondence learning (Drezde et al., 2007) to the task of domain adaptation for sentiment classification of product reviews.
    Page 2, “Domain Adaptation in Sentiment Research”
  3. In sentiment tagging and related areas, Aue and Gamon (2005) demonstrated that combining classifiers can be a valuable tool in domain adaptation for sentiment analysis.
    Page 6, “Integrating the Corpus-based and Dictionary-based Approaches”
  4. This study contributes to the research on sentiment tagging, domain adaptation , and the development of ensembles of classifiers (l) by proposing a novel approach for sentiment determination at sentence level and delineating the conditions under which greatest synergies among combined classifiers can be achieved, (2) by describing a precision-based technique for assigning differential weights to classifier results on different categories identified by the classifier (i.e., categories of positive vs. negative sentences), and (3) by proposing a new method for sentiment annotation in situations where the annotated in-domain data is scarce and insufficient to ensure adequate performance of the corpus-based classifier, which still remains the preferred choice when large volumes of annotated data are available for system training.
    Page 8, “Conclusion”

See all papers in Proc. ACL 2008 that mention domain adaptation.

See all papers in Proc. ACL that mention domain adaptation.

Back to top.

manually annotated

Appears in 3 sentences as: manually annotated (3)
In When Specialists and Generalists Work Together: Overcoming Domain Dependence in Sentiment Tagging
  1. 0 A balanced corpus of 800 manually annotated sentences extracted from 83 newspaper texts
    Page 3, “Factors Affecting System Performance”
  2. 200 sentences from this corpus (100 positive and 100 negative) were also randomly selected from the corpus for an inter-annotator agreement study and were manually annotated by two independent annotators.
    Page 3, “Factors Affecting System Performance”
  3. In order to assign the membership score to each word, we did 58 system runs on unique nonintersecting seed lists drawn from manually annotated list of positive and negative adjectives from (Hatzivassiloglou and McKeown, 1997).
    Page 6, “Lexicon-Based Approach”

See all papers in Proc. ACL 2008 that mention manually annotated.

See all papers in Proc. ACL that mention manually annotated.

Back to top.

n-gram

Appears in 3 sentences as: n-gram (3)
In When Specialists and Generalists Work Together: Overcoming Domain Dependence in Sentiment Tagging
  1. Due to lower frequency of higher-order n-grams (as opposed to unigrams), higher-order n-gram language models are more sparse, which increases the probability of missing a particular sentiment marker in a sentence (Table 33).
    Page 4, “Experiments”
  2. training sets are required to overcome this higher n-gram sparseness in sentence-level annotation.
    Page 5, “Experiments”
  3. results depends on the genre and size of the n-gram : on product reviews, all results are statistically significant at oz 2 0.025 level; on movie reviews, the difference between NaVe Bayes and SVM is statistically significant at oz 2 0.01 but the significance diminishes as the size of the n- gram increases; on news, only bigrams produce a statistically significant (a = 0.01) difference between the two machine learning methods, while on blogs the difference between SVMs and NaVe Bayes is most pronounced when unigrams are used (a = 0.025).
    Page 5, “Experiments”

See all papers in Proc. ACL 2008 that mention n-gram.

See all papers in Proc. ACL that mention n-gram.

Back to top.

sentiment classification

Appears in 3 sentences as: sentiment classification (2) sentiment classifiers (1)
In When Specialists and Generalists Work Together: Overcoming Domain Dependence in Sentiment Tagging
  1. Most text-level sentiment classifiers use standard machine learning techniques to learn and select features from labeled corpora.
    Page 2, “Domain Adaptation in Sentiment Research”
  2. (2007) applied structural correspondence learning (Drezde et al., 2007) to the task of domain adaptation for sentiment classification of product reviews.
    Page 2, “Domain Adaptation in Sentiment Research”
  3. To our knowledge, the only work that describes the application of statistical classifiers (SVM) to sentence-level sentiment classification is (Gamon and Aue, 2005)1.
    Page 4, “Factors Affecting System Performance”

See all papers in Proc. ACL 2008 that mention sentiment classification.

See all papers in Proc. ACL that mention sentiment classification.

Back to top.