Utterance-Level Multimodal Sentiment Analysis
Perez-Rosas, Veronica and Mihalcea, Rada and Morency, Louis-Philippe

Article Structure

Abstract

During real-life interactions, people are naturally gesturing and modulating their voice to emphasize specific points or to express their emotions.

Introduction

Video reviews represent a growing source of consumer information that gained increasing interest from companies, researchers, and consumers.

Related Work

In this section we provide a brief overview of related work in text-based sentiment analysis, as well as audiovisual emotion analysis.

MOUD: Multimodal Opinion Utterances Dataset

For our experiments, we created a dataset of utterances (named MOUD) containing product opinions expressed in Spanish.1 We chose to work with Spanish because it is a widely used language, and it is the native language of the main author of this paper.

Multimodal Sentiment Analysis

The main advantage that comes with the analysis of video opinions, as compared to their textual counterparts, is the availability of visual and speech cues.

Experiments and Results

We run our sentiment classification experiments on the MOUD dataset introduced earlier.

Discussion

The experimental results show that sentiment classification can be effectively performed on multimodal datastreams.

Conclusions

In this paper, we presented a multimodal approach for utterance-level sentiment classification.

Topics

sentiment analysis

Appears in 11 sentences as: Sentiment Analysis (1) sentiment analysis (10)
In Utterance-Level Multimodal Sentiment Analysis
  1. Using a new multimodal dataset consisting of sentiment annotated utterances extracted from video reviews, we show that multimodal sentiment analysis can be effectively performed, and that the joint use of visual, acoustic, and linguistic modalities can lead to error rate reductions of up to 10.5% as compared to the best performing individual modality.
    Page 1, “Abstract”
  2. This is in line with earlier work on text-based sentiment analysis , where it has been observed that full-document reviews often contain both positive and negative comments, which led to a number of methods addressing opinion analysis at sentence level.
    Page 1, “Introduction”
  3. In our work, this dataset enabled a wide range of multimodal sentiment analysis experiments, addressing the relative importance of modalities and individual features.
    Page 1, “Introduction”
  4. The following section presents related work in text-based sentiment analysis and audiovisual emotion recognition.
    Page 1, “Introduction”
  5. In this section we provide a brief overview of related work in text-based sentiment analysis , as well as audiovisual emotion analysis.
    Page 2, “Related Work”
  6. 2.1 Text-based Subjectivity and Sentiment Analysis
    Page 2, “Related Work”
  7. The techniques developed so far for subjectivity and sentiment analysis have focused primarily on the processing of text, and consist of either rule-based classifiers that make use of opinion lexicons, or data-driven methods that assume the availability of a large dataset annotated for polarity.
    Page 2, “Related Work”
  8. One of the first lexicons used in sentiment analysis is the General Inquirer (Stone, 1968).
    Page 2, “Related Work”
  9. While difficult problems such as cross-domain (Blitzer et al., 2007; Li et al., 2012) or cross-language (Mihalcea et al., 2007; Wan, 2009; Meng et al., 2012) portability have been addressed, not much has been done in terms of extending the applicability of sentiment analysis to other modalities,
    Page 2, “Related Work”
  10. Second, previous work in subjectivity and sentiment analysis has demonstrated that a layered approach (where neutral statements are first separated from opinion statements followed by a separation between positive and negative statements) works better than a single three-way classification.
    Page 6, “Experiments and Results”
  11. Video-level sentiment analysis .
    Page 7, “Discussion”

See all papers in Proc. ACL 2013 that mention sentiment analysis.

See all papers in Proc. ACL that mention sentiment analysis.

Back to top.

sentiment classification

Appears in 11 sentences as: sentiment classification (10) sentiment classifiers (1)
In Utterance-Level Multimodal Sentiment Analysis
  1. This paper presents a method for multimodal sentiment classification , which can identify the sentiment expressed in utterance-level visual datastreams.
    Page 1, “Abstract”
  2. Our experiments and results on multimodal sentiment classification are presented in Section 5, with a detailed discussion and analysis in Section 6.
    Page 2, “Introduction”
  3. These simple weighted unigram features have been successfully used in the past to build sentiment classifiers on text, and in conjunction with Support Vector Machines (SVM) have been shown to lead to state-of-the-art performance (Maas et al., 2011).
    Page 5, “Multimodal Sentiment Analysis”
  4. We run our sentiment classification experiments on the MOUD dataset introduced earlier.
    Page 6, “Experiments and Results”
  5. Table 2: Utterance-level sentiment classification with linguistic, acoustic, and visual features.
    Page 6, “Experiments and Results”
  6. Table 2 shows the results of the utterance-level sentiment classification experiments.
    Page 6, “Experiments and Results”
  7. The experimental results show that sentiment classification can be effectively performed on multimodal datastreams.
    Page 6, “Discussion”
  8. Other informative features for sentiment classification are the voice probability, representing the energy in speech, the combined visual features that represent an angry face, and two of the cepstral coefficients.
    Page 7, “Discussion”
  9. To understand the role played by the size of the video-segments considered in the sentiment classification experiments, as well as the potential effect of a speaker-independence assumption, we also run a set of experiments where we use full videos for the classification.
    Page 7, “Discussion”
  10. In this paper, we presented a multimodal approach for utterance-level sentiment classification .
    Page 7, “Conclusions”
  11. Table 4: Video-level sentiment classification with linguistic, acoustic, and visual features.
    Page 8, “Conclusions”

See all papers in Proc. ACL 2013 that mention sentiment classification.

See all papers in Proc. ACL that mention sentiment classification.

Back to top.

feature vector

Appears in 4 sentences as: feature vector (2) feature vectors (2)
In Utterance-Level Multimodal Sentiment Analysis
  1. The features are averaged over all the frames in an utterance, to obtain one feature vector for each utterance.
    Page 5, “Multimodal Sentiment Analysis”
  2. In this approach, the features collected from all the multimodal streams are combined into a single feature vector , thus resulting in one vector for each utterance in the dataset which is used to make a decision about the sentiment orientation of the utterance.
    Page 6, “Experiments and Results”
  3. This may be due to the smaller number of feature vectors used in the experiments (only 80, as compared to the 412 used in the previous setup).
    Page 7, “Discussion”
  4. Another possible reason is the fact that the acoustic and visual modalities are significantly weaker than the linguistic modality, most likely due to the fact that the feature vectors are now speaker-independent, which makes it harder to improve over the linguistic modality alone.
    Page 7, “Discussion”

See all papers in Proc. ACL 2013 that mention feature vector.

See all papers in Proc. ACL that mention feature vector.

Back to top.

error rate

Appears in 3 sentences as: error rate (3)
In Utterance-Level Multimodal Sentiment Analysis
  1. Using a new multimodal dataset consisting of sentiment annotated utterances extracted from video reviews, we show that multimodal sentiment analysis can be effectively performed, and that the joint use of visual, acoustic, and linguistic modalities can lead to error rate reductions of up to 10.5% as compared to the best performing individual modality.
    Page 1, “Abstract”
  2. Compared to the best individual classifier, the relative error rate reduction obtained with the trimodal classifier is 10.5%.
    Page 7, “Discussion”
  3. Our experiments show that sentiment annotation of utterance-level visual datastreams can be effectively performed, and that the use of multiple modalities can lead to error rate reductions of up to 10.5% as compared to the use of one modality at a time.
    Page 8, “Conclusions”

See all papers in Proc. ACL 2013 that mention error rate.

See all papers in Proc. ACL that mention error rate.

Back to top.

feature weights

Appears in 3 sentences as: Feature Weights (1) feature weights (2)
In Utterance-Level Multimodal Sentiment Analysis
  1. Feature Weights
    Page 7, “Discussion”
  2. Figure 2: Visual and acoustic feature weights .
    Page 7, “Discussion”
  3. To determine the role played by each of the visual and acoustic features, we compare the feature weights assigned by the learning algorithm, as shown in Figure 2.
    Page 7, “Discussion”

See all papers in Proc. ACL 2013 that mention feature weights.

See all papers in Proc. ACL that mention feature weights.

Back to top.

SVM

Appears in 3 sentences as: SVM (3)
In Utterance-Level Multimodal Sentiment Analysis
  1. These simple weighted unigram features have been successfully used in the past to build sentiment classifiers on text, and in conjunction with Support Vector Machines ( SVM ) have been shown to lead to state-of-the-art performance (Maas et al., 2011).
    Page 5, “Multimodal Sentiment Analysis”
  2. We use the entire set of 412 utterances and run ten fold cross validations using an SVM classifier, as implemented in the Weka toolkit.5 In line with previous work on emotion recognition in speech (Haq and Jackson, 2009; Anagnostopoulos and Vovoli, 2010) where utterances are selected in a speaker dependent manner (i.e., utterances from the same speaker are included in both training and test), as well as work on sentence-level opinion classification where document boundaries are not considered in the split performed between the training and test sets (Wilson et al., 2004; Wiegand and Klakow, 2009), the training/test split for each fold is performed at utterance level regardless of the video they belong to.
    Page 6, “Experiments and Results”
  3. As before, the linguistic, acoustic, and visual features are averaged over the entire video, and we use an SVM classifier in tenfold cross validation experiments.
    Page 7, “Discussion”

See all papers in Proc. ACL 2013 that mention SVM.

See all papers in Proc. ACL that mention SVM.

Back to top.

unigram

Appears in 3 sentences as: unigram (4)
In Utterance-Level Multimodal Sentiment Analysis
  1. We use a bag-of-words representation of the video transcriptions of each utterance to derive unigram counts, which are then used as linguistic features.
    Page 5, “Multimodal Sentiment Analysis”
  2. The remaining words represent the unigram features, which are then associated with a value corresponding to the frequency of the unigram inside each utterance transcription.
    Page 5, “Multimodal Sentiment Analysis”
  3. These simple weighted unigram features have been successfully used in the past to build sentiment classifiers on text, and in conjunction with Support Vector Machines (SVM) have been shown to lead to state-of-the-art performance (Maas et al., 2011).
    Page 5, “Multimodal Sentiment Analysis”

See all papers in Proc. ACL 2013 that mention unigram.

See all papers in Proc. ACL that mention unigram.

Back to top.