The Haves and the Have-Nots: Leveraging Unlabelled Corpora for Sentiment Analysis
Popat, Kashyap and A.R, Balamurali and Bhattacharyya, Pushpak and Haffari, Gholamreza

Article Structure

Abstract

Expensive feature engineering based on WordNet senses has been shown to be useful for document level sentiment classification.

Introduction

Data sparsity is the bane of Natural Language Processing (NLP) (Xue et al., 2005; Minkov et al., 2007).

Related Work

The problem of SA at document level is defined as the classification of document into different polarity classes (positive and negative) (Tumey, 2002).

Clustering for Sentiment Analysis

The goal of this paper, to remind the reader, is to investigate whether superior word cluster features based on manually crafted and fine grained lexical resource like WordNet can be replaced with the syntagmatic property based word clusters created from unlabelled monolingual corpora.

Clustering for Cross Lingual Sentiment Analysis

Existing approaches for CLSA depend on an intermediary machine translation system to bridge the language gap (Hiroshi et al., 2004; Banea et al., 2008).

Experimental Setup

Analysis was performed on three languages, viz., English (En), Hindi (Hi) and Marathi (Mar).

Results

Monolingual classification results are shown in Table7l.

Discussions

In this section, some important observations from the results are discussed.

Conclusion and Future Work

This paper explored feasibility of using word cluster based features in lieu of features based on WordNet senses for sentiment analysis to alleviate the problem of data sparsity.

Topics

WordNet

Appears in 27 sentences as: (1) WordNet (23) WordNets (4)
In The Haves and the Have-Nots: Leveraging Unlabelled Corpora for Sentiment Analysis
  1. Expensive feature engineering based on WordNet senses has been shown to be useful for document level sentiment classification.
    Page 1, “Abstract”
  2. WordNet is a byproduct of such an analysis.
    Page 1, “Introduction”
  3. In WordNet , paradigms are manually generated based on the principles of lexical and semantic relationship among words (Fellbaum, 1998).
    Page 1, “Introduction”
  4. WordNets are primarily used to address the problem of word sense disambiguation.
    Page 1, “Introduction”
  5. However, at present there are many NLP applications which use WordNet .
    Page 1, “Introduction”
  6. As WordNets are essentially word
    Page 1, “Introduction”
  7. The abstraction and dimensionality reduction thus achieved attributes to the superior performance for SA systems that employs WordNet senses as features.
    Page 2, “Introduction”
  8. However, WordNets are manually created.
    Page 2, “Introduction”
  9. In case of SA, manually creating the features based on WordNet senses is a tedious and an expensive process.
    Page 2, “Introduction”
  10. Moreover, WordNets are not present for many languages.
    Page 2, “Introduction”
  11. All these factors make the paradigmatic property based cluster features like WordNet senses a less promising pursuit for SA.
    Page 2, “Introduction”

See all papers in Proc. ACL 2013 that mention WordNet.

See all papers in Proc. ACL that mention WordNet.

Back to top.

cross-lingual

Appears in 15 sentences as: Cross-Lingual (2) Cross-lingual (4) cross-lingual (11)
In The Haves and the Have-Nots: Leveraging Unlabelled Corpora for Sentiment Analysis
  1. In this paper, the problem of data sparsity in sentiment analysis, both monolingual and cross-lingual , is addressed through the means of clustering.
    Page 1, “Abstract”
  2. Popular approaches for Cross-Lingual Sentiment Analysis (CLSA) (Wan, 2009; Duh et al., 2011) depend on Machine Translation (MT) for converting the labeled data from one language to the other (Hiroshi et al., 2004; Banea et al., 2008; Wan, 2009).
    Page 2, “Introduction”
  3. Instead, language gap for performing CLSA is bridged using linked cluster or cross-lingual clusters (explained in section 4) with the help of unlabelled monolingual corpora.
    Page 2, “Introduction”
  4. In situations where labeled data is not present in a language, approaches based on cross-lingual sentiment analysis are used.
    Page 3, “Related Work”
  5. 4.3 Approach 3: Cross-Lingual Clustering (XC)
    Page 4, “Clustering for Cross Lingual Sentiment Analysis”
  6. (2012) introduced cross-lingual clustering.
    Page 4, “Clustering for Cross Lingual Sentiment Analysis”
  7. In cross-lingual clustering, the objective function maximizes the joint likelihood of monolingual and cross-lingual factors.
    Page 4, “Clustering for Cross Lingual Sentiment Analysis”
  8. Whereas in case of cross-lingual clustering, the same clustering can be explained in terms of maximizing the likelihood of monolingual word-cluster pairs of the source, the target and alignments between them.
    Page 4, “Clustering for Cross Lingual Sentiment Analysis”
  9. Algorithm 2 Cross-lingual Clustering (XC) Input: Source and target language corpus Output: Cross-lingual clusters
    Page 5, “Clustering for Cross Lingual Sentiment Analysis”
  10. Cross-lingual clustering for CLSA
    Page 5, “Experimental Setup”
  11. Cross-lingual SA accuracies are presented in Table 3.
    Page 6, “Results”

See all papers in Proc. ACL 2013 that mention cross-lingual.

See all papers in Proc. ACL that mention cross-lingual.

Back to top.

data sparsity

Appears in 13 sentences as: Data sparsity (1) data sparsity (12)
In The Haves and the Have-Nots: Leveraging Unlabelled Corpora for Sentiment Analysis
  1. A plausible reason for such a performance improvement is the reduction in data sparsity .
    Page 1, “Abstract”
  2. In this paper, the problem of data sparsity in sentiment analysis, both monolingual and cross-lingual, is addressed through the means of clustering.
    Page 1, “Abstract”
  3. Experiments show that cluster based data sparsity reduction leads to performance better than sense based classification for sentiment analysis at document level.
    Page 1, “Abstract”
  4. Similar idea is applied to Cross Lingual Sentiment Analysis (CLSA), and it is shown that reduction in data sparsity (after translation or bilingual-mapping) produces accuracy higher than Machine Translation based CLSA and sense based CLSA.
    Page 1, “Abstract”
  5. Data sparsity is the bane of Natural Language Processing (NLP) (Xue et al., 2005; Minkov et al., 2007).
    Page 1, “Introduction”
  6. NLP applications innovatively handle data sparsity through various means.
    Page 1, “Introduction”
  7. A special, but very common kind of data sparsity viz, word sparsity, can be addressed in one of the two obvious ways: 1) sparsity reduction through paradigmatically related words or 2) sparsity reduction through syntagmatically related words.
    Page 1, “Introduction”
  8. In this paper, the focus is on alleviating the data sparsity faced by supervised approaches for SA through the means of cluster based features.
    Page 1, “Introduction”
  9. clusters wherein words with the same meaning are clubbed together, they address the problem of data sparsity at word level.
    Page 2, “Introduction”
  10. In the current work, this particular insight is used to solve the data sparsity problem in the sentiment analysis by leveraging unlabelled monolingual corpora.
    Page 2, “Introduction”
  11. Section 3 explains different word cluster based features employed to reduce data sparsity for monolingual SA.
    Page 2, “Introduction”

See all papers in Proc. ACL 2013 that mention data sparsity.

See all papers in Proc. ACL that mention data sparsity.

Back to top.

sentiment analysis

Appears in 12 sentences as: Sentiment Analysis (3) sentiment analysis (9)
In The Haves and the Have-Nots: Leveraging Unlabelled Corpora for Sentiment Analysis
  1. In this paper, the problem of data sparsity in sentiment analysis , both monolingual and cross-lingual, is addressed through the means of clustering.
    Page 1, “Abstract”
  2. Experiments show that cluster based data sparsity reduction leads to performance better than sense based classification for sentiment analysis at document level.
    Page 1, “Abstract”
  3. Similar idea is applied to Cross Lingual Sentiment Analysis (CLSA), and it is shown that reduction in data sparsity (after translation or bilingual-mapping) produces accuracy higher than Machine Translation based CLSA and sense based CLSA.
    Page 1, “Abstract”
  4. One such application is Sentiment Analysis (SA) (Pang and Lee, 2002).
    Page 1, “Introduction”
  5. In the current work, this particular insight is used to solve the data sparsity problem in the sentiment analysis by leveraging unlabelled monolingual corpora.
    Page 2, “Introduction”
  6. Popular approaches for Cross-Lingual Sentiment Analysis (CLSA) (Wan, 2009; Duh et al., 2011) depend on Machine Translation (MT) for converting the labeled data from one language to the other (Hiroshi et al., 2004; Banea et al., 2008; Wan, 2009).
    Page 2, “Introduction”
  7. There has been research related to clustering and sentiment analysis .
    Page 3, “Related Work”
  8. (2011) attempts to cluster features of a product to perform sentiment analysis on product reviews.
    Page 3, “Related Work”
  9. In situations where labeled data is not present in a language, approaches based on cross-lingual sentiment analysis are used.
    Page 3, “Related Work”
  10. Given that sentiment analysis is a less resource intensive task compared to machine translation, the use of an MT system is hard to justify for performing
    Page 3, “Clustering for Cross Lingual Sentiment Analysis”
  11. The reason for the drop in the accuracy of approach based on sense features for En-PD dataset is the domain specific nature of sentiment analysis (Blitzer et al., 2007), which is explained in the next point.
    Page 7, “Discussions”

See all papers in Proc. ACL 2013 that mention sentiment analysis.

See all papers in Proc. ACL that mention sentiment analysis.

Back to top.

MT system

Appears in 10 sentences as: MT system (8) MT systems (3)
In The Haves and the Have-Nots: Leveraging Unlabelled Corpora for Sentiment Analysis
  1. However, many languages which are truly resource scarce, do not have an MT system or existing MT systems are not ripe to be used for CLSA (Balamurali et al., 2013).
    Page 2, “Introduction”
  2. No MT systems or bilingual dictionaries are used for this study.
    Page 2, “Introduction”
  3. Given the subtle and different ways the sentiment can be expressed which itself manifested as a result of cultural diversity amongst different languages, an MT system has to be of a superior quality to capture them.
    Page 3, “Related Work”
  4. If a language is truly resource scarce, it is mostly unlikely to have an MT system .
    Page 3, “Clustering for Cross Lingual Sentiment Analysis”
  5. Given that sentiment analysis is a less resource intensive task compared to machine translation, the use of an MT system is hard to justify for performing
    Page 3, “Clustering for Cross Lingual Sentiment Analysis”
  6. A note on CLSA for truly resource scarce languages: Note that there is no publicly available MT system for English to Marathi.
    Page 8, “Discussions”
  7. For CLSA, clusters linked together using unlabelled parallel corpora do away with the need of translating labelled corpora from one language to another using an intermediary MT system or bilingual dictionary.
    Page 8, “Conclusion and Future Work”
  8. Further, this approach was found to be useful in cases where there are no MT systems to perform CLSA and the language of analysis is truly resource scarce.
    Page 8, “Conclusion and Future Work”
  9. Thus, wider implication of this study is that many widely spoken yet resource scare languages like Pashto, Sundanese, Hausa, Gujarati and Punjabi which do not have an MT system could now be analysed for sentiment.
    Page 8, “Conclusion and Future Work”
  10. for CLSA can considerably be much lesser than the size of the parallel corpora required to train an MT system .
    Page 9, “Conclusion and Future Work”

See all papers in Proc. ACL 2013 that mention MT system.

See all papers in Proc. ACL that mention MT system.

Back to top.

sentiment classification

Appears in 9 sentences as: sentiment classification (4) sentiment classified (1) sentiment classifier (4) sentiment classifiers (1)
In The Haves and the Have-Nots: Leveraging Unlabelled Corpora for Sentiment Analysis
  1. Expensive feature engineering based on WordNet senses has been shown to be useful for document level sentiment classification .
    Page 1, “Abstract”
  2. Word clustering is a powerful mechanism to “transfer” a sentiment classifier from one language to another.
    Page 2, “Introduction”
  3. (2011) showed that WordNet synsets can act as good features for document level sentiment classification .
    Page 3, “Clustering for Sentiment Analysis”
  4. In this study, synset identifiers are extracted from manually/automatically sense annotated corpora and used as features for creating sentiment classifiers .
    Page 3, “Clustering for Sentiment Analysis”
  5. of sentiment classification , cluster identifiers
    Page 3, “Clustering for Sentiment Analysis”
  6. The language whose annotated data is used for training is called the source language (8), while the language whose documents are to be sentiment classified is referred to as the target language (T).
    Page 4, “Clustering for Cross Lingual Sentiment Analysis”
  7. Algorithm 1 Projection based on sense Input: Polarity labeled data in source language (S) and data in target language (T) to be labeled Output: Classified documents 1: Sense mark the polarity labeled data from S 2: Project the sense marked corpora from S to T using a Multidict 3: Model the sentiment classifier using the data obtained in step-2 4: Sense mark the unlabelled data from T 5: Test the sentiment classifier on data obtained in step-4 using model obtained in step-3
    Page 4, “Clustering for Cross Lingual Sentiment Analysis”
  8. SVM was used since it is known to perform well for sentiment classification (Pang et al., 2002).
    Page 6, “Experimental Setup”
  9. Whereas, sentiment classifier using sense (PS) or direct cluster linking (DCL) is not very effective.
    Page 8, “Discussions”

See all papers in Proc. ACL 2013 that mention sentiment classification.

See all papers in Proc. ACL that mention sentiment classification.

Back to top.

machine translation

Appears in 8 sentences as: Machine Translation (2) Machine translation (1) machine translation (5)
In The Haves and the Have-Nots: Leveraging Unlabelled Corpora for Sentiment Analysis
  1. Similar idea is applied to Cross Lingual Sentiment Analysis (CLSA), and it is shown that reduction in data sparsity (after translation or bilingual-mapping) produces accuracy higher than Machine Translation based CLSA and sense based CLSA.
    Page 1, “Abstract”
  2. When used as an additional feature with word based language models, it has been shown to improve the system performance viz, machine translation (Uszkoreit and Brants, 2008; Stymne, 2012), speech recognition (Martin et al., 1995; Samuelsson and Reichl, 1999), dependency parsing (Koo et al., 2008; Haffari et al., 2011; Zhang and Nivre, 2011; Tratz and Hovy, 2011) and NER (Miller et al., 2004; Faruqui and Pado, 2010; Turian et al., 2010; Tackstro'm et al., 2012).
    Page 1, “Introduction”
  3. Popular approaches for Cross-Lingual Sentiment Analysis (CLSA) (Wan, 2009; Duh et al., 2011) depend on Machine Translation (MT) for converting the labeled data from one language to the other (Hiroshi et al., 2004; Banea et al., 2008; Wan, 2009).
    Page 2, “Introduction”
  4. Most often these methods depend on an intermediary machine translation system (Wan, 2009; Brooke et al., 2009) or a bilingual dictionary (Ghorbel and Jacot, 2011; Lu et al., 2011) to bridge the language gap.
    Page 3, “Related Work”
  5. Existing approaches for CLSA depend on an intermediary machine translation system to bridge the language gap (Hiroshi et al., 2004; Banea et al., 2008).
    Page 3, “Clustering for Cross Lingual Sentiment Analysis”
  6. Machine translation is very resource intensive.
    Page 3, “Clustering for Cross Lingual Sentiment Analysis”
  7. Given that sentiment analysis is a less resource intensive task compared to machine translation , the use of an MT system is hard to justify for performing
    Page 3, “Clustering for Cross Lingual Sentiment Analysis”
  8. This could degrade the accuracy of the machine translation itself, limiting the performance of an MT based CLSA system.
    Page 8, “Discussions”

See all papers in Proc. ACL 2013 that mention machine translation.

See all papers in Proc. ACL that mention machine translation.

Back to top.

parallel corpora

Appears in 7 sentences as: parallel corpora (7)
In The Haves and the Have-Nots: Leveraging Unlabelled Corpora for Sentiment Analysis
  1. Given a parallel bilingual corpus, word clusters in S can be aligned to clusters in T. Word alignments are created using parallel corpora .
    Page 4, “Clustering for Cross Lingual Sentiment Analysis”
  2. Direct cluster linking approach suffers from the size of alignment dataset in the form of parallel corpora .
    Page 4, “Clustering for Cross Lingual Sentiment Analysis”
  3. To create alignments, English-Hindi and English-Marathi parallel corpora from ILCI were used.
    Page 5, “Experimental Setup”
  4. For CLSA, clusters linked together using unlabelled parallel corpora do away with the need of translating labelled corpora from one language to another using an intermediary MT system or bilingual dictionary.
    Page 8, “Conclusion and Future Work”
  5. The approach presented here for CLSA will still require a parallel corpora .
    Page 8, “Conclusion and Future Work”
  6. However, the size of the parallel corpora required
    Page 8, “Conclusion and Future Work”
  7. for CLSA can considerably be much lesser than the size of the parallel corpora required to train an MT system.
    Page 9, “Conclusion and Future Work”

See all papers in Proc. ACL 2013 that mention parallel corpora.

See all papers in Proc. ACL that mention parallel corpora.

Back to top.

synset

Appears in 7 sentences as: synset (4) synsets (3)
In The Haves and the Have-Nots: Leveraging Unlabelled Corpora for Sentiment Analysis
  1. A synonymous set of words in a WordNet is called a synset .
    Page 3, “Clustering for Sentiment Analysis”
  2. Each synset can be considered as a word cluster comprising of semantically similar words.
    Page 3, “Clustering for Sentiment Analysis”
  3. (2011) showed that WordNet synsets can act as good features for document level sentiment classification.
    Page 3, “Clustering for Sentiment Analysis”
  4. The results suggested that WordNet synset based features performed better than word-based features.
    Page 3, “Clustering for Sentiment Analysis”
  5. In this study, synset identifiers are extracted from manually/automatically sense annotated corpora and used as features for creating sentiment classifiers.
    Page 3, “Clustering for Sentiment Analysis”
  6. For example, on En-PD, percentage of features present in the test set and not present in the training set to those present in the test set are 34.17%, 11.24%, 0.31% for words, synsets
    Page 7, “Discussions”
  7. However, it must be noted that clustering based on unlabelled corpora is less taxing than manually creating paradigmatic property based clusters like WordNet synsets .
    Page 7, “Discussions”

See all papers in Proc. ACL 2013 that mention synset.

See all papers in Proc. ACL that mention synset.

Back to top.

word alignments

Appears in 5 sentences as: Word alignments (1) word alignments (4)
In The Haves and the Have-Nots: Leveraging Unlabelled Corpora for Sentiment Analysis
  1. To perform CLSA, this study leverages unlabelled parallel corpus to generate the word alignments .
    Page 2, “Introduction”
  2. These word alignments are then used to link cluster based features to obliterate the language gap for performing SA.
    Page 2, “Introduction”
  3. Given a parallel bilingual corpus, word clusters in S can be aligned to clusters in T. Word alignments are created using parallel corpora.
    Page 4, “Clustering for Cross Lingual Sentiment Analysis”
  4. Here, LTIS and LSIT(...) are factors based on word alignments , which can be represented as:
    Page 5, “Clustering for Cross Lingual Sentiment Analysis”
  5. A naive cluster linkage algorithm based on word alignments was used to perform CLSA.
    Page 9, “Conclusion and Future Work”

See all papers in Proc. ACL 2013 that mention word alignments.

See all papers in Proc. ACL that mention word alignments.

Back to top.

labeled data

Appears in 3 sentences as: labeled data (4)
In The Haves and the Have-Nots: Leveraging Unlabelled Corpora for Sentiment Analysis
  1. Popular approaches for Cross-Lingual Sentiment Analysis (CLSA) (Wan, 2009; Duh et al., 2011) depend on Machine Translation (MT) for converting the labeled data from one language to the other (Hiroshi et al., 2004; Banea et al., 2008; Wan, 2009).
    Page 2, “Introduction”
  2. In situations where labeled data is not present in a language, approaches based on cross-lingual sentiment analysis are used.
    Page 3, “Related Work”
  3. Algorithm 1 Projection based on sense Input: Polarity labeled data in source language (S) and data in target language (T) to be labeled Output: Classified documents 1: Sense mark the polarity labeled data from S 2: Project the sense marked corpora from S to T using a Multidict 3: Model the sentiment classifier using the data obtained in step-2 4: Sense mark the unlabelled data from T 5: Test the sentiment classifier on data obtained in step-4 using model obtained in step-3
    Page 4, “Clustering for Cross Lingual Sentiment Analysis”

See all papers in Proc. ACL 2013 that mention labeled data.

See all papers in Proc. ACL that mention labeled data.

Back to top.

parallel corpus

Appears in 3 sentences as: parallel corpus (4)
In The Haves and the Have-Nots: Leveraging Unlabelled Corpora for Sentiment Analysis
  1. To perform CLSA, this study leverages unlabelled parallel corpus to generate the word alignments.
    Page 2, “Introduction”
  2. As a viable alternative, cluster linkages could be learned from a bilingual parallel corpus and these linkages can be used to bridge the language gap for CLSA.
    Page 4, “Clustering for Cross Lingual Sentiment Analysis”
  3. English-Hindi parallel corpus contains 45992 sentences and English-Marathi parallel corpus contains 47881 sentences.
    Page 5, “Experimental Setup”

See all papers in Proc. ACL 2013 that mention parallel corpus.

See all papers in Proc. ACL that mention parallel corpus.

Back to top.

sense disambiguation

Appears in 3 sentences as: sense disambiguation (3)
In The Haves and the Have-Nots: Leveraging Unlabelled Corpora for Sentiment Analysis
  1. WordNets are primarily used to address the problem of word sense disambiguation .
    Page 1, “Introduction”
  2. by using automatic/manual sense disambiguation techniques.
    Page 3, “Clustering for Sentiment Analysis”
  3. The sense disambiguation accuracy of the same would have lowered in a cross-domain setting.
    Page 7, “Discussions”

See all papers in Proc. ACL 2013 that mention sense disambiguation.

See all papers in Proc. ACL that mention sense disambiguation.

Back to top.