Active Learning with Efficient Feature Weighting Methods for Improving Data Quality and Classification Accuracy
Martineau, Justin and Chen, Lu and Cheng, Doreen and Sheth, Amit

Article Structure

Abstract

Many machine learning datasets are noisy with a substantial number of mislabeled instances.

Introduction

Supervised classification algorithms require annotated data to teach the machine, by example, how to perform a specific task.

Related Work

Research on handling noisy dataset of mislabeled instances has focused on three major groups of techniques: (1) noise tolerance, (2) noise elimination, and (3) noise correction.

An Active Learning Framework for Label Correction

Let D = {($1,y1), ..., be a dataset of binary labeled instances, where the instance :0, be-

Feature Weighting Methods

Building the classifier C that allows the most likely mislabeled instances to be selected and annotated is the essence of the active learning approach.

Experiments

We conduct experiments on a Twitter dataset that contains tweets about TV shows and movies.

Conclusion

In this paper, we explored an active learning approach to improve data annotation quality for classification tasks.

Topics

ground truth

Appears in 9 sentences as: Ground Truth (1) Ground truth (1) ground truth (7)
In Active Learning with Efficient Feature Weighting Methods for Improving Data Quality and Classification Accuracy
  1. Table 2: Ground truth annotatior
    Page 6, “Experiments”
  2. After that, the same dataset was annotated independently by a group of expert annotators to create the ground truth .
    Page 6, “Experiments”
  3. We first describe the AMT annotation and ground truth annotation, and then discuss the baselines and experimental results.
    Page 6, “Experiments”
  4. Ground Truth Annotation: After we obtained the annotated dataset from AMT, we posted the same dataset (without the labels) to a group of expert annotators.
    Page 6, “Experiments”
  5. We used this annotated dataset as ground truth .
    Page 6, “Experiments”
  6. See Table 2 for the statistics of the ground truth annotations.
    Page 6, “Experiments”
  7. Compared with the ground truth , many emotion bearing tweets were missed by the AMT annotators, despite the quality control we applied.
    Page 6, “Experiments”
  8. For the experimental purpose, the re-annotation was done by assigning the ground truth labels to the selected instances.
    Page 7, “Experiments”
  9. We compared the improved dataset with the final ground truth at the end of each round to monitor the progress.
    Page 8, “Experiments”

See all papers in Proc. ACL 2014 that mention ground truth.

See all papers in Proc. ACL that mention ground truth.

Back to top.

F1 Score

Appears in 8 sentences as: F1 Score (7) F1 score (2)
In Active Learning with Efficient Feature Weighting Methods for Improving Data Quality and Classification Accuracy
  1. Evaluation Metric: We evaluated the results with both Mean Average Precision (MAP) and F1 Score .
    Page 7, “Experiments”
  2. Macro—averaged F1 Score .0 .o .o .o .o 00 00 A A 01 00 00 00 00 00 I I I I I
    Page 7, “Experiments”
  3. (b) Macro-Averaged F1 Score
    Page 7, “Experiments”
  4. We reported both the macro-averaged MAP (Figure 1a) and the macro-averaged F1 Score (Figure 1b) on eight emotions as the overall performance of three competitive methods —Spread, SVM-Delta-IDF and SVM-TF.
    Page 8, “Experiments”
  5. In comparison, SVM-Delta-IDF significantly outperforms SVM-TF with respect to both MAP and F1 Score .
    Page 8, “Experiments”
  6. SVM-TF achieves higher MAP and F1 Score than Spread at the first few iterations, but then it is beat by Spread after 16,500 tweets had been selected and re-annotated till the eighth iteration.
    Page 8, “Experiments”
  7. Overall, at the end of the active learning process, Spread outperforms SVM-TF by 3.03% the MAP score (and by 4.29% the F1 score), and SVM-Delta-IDF outperforms SVM-TF by 8.59% the MAP score (and by 5.26% the F1 score ).
    Page 8, “Experiments”
  8. Spread achieves a F1 Score of 58.84%, which is quite competitive compared to 59.82% achieved by SVM-Delta-IDF, though SVM-Delta-IDF outperforms Spread with respect to MAP.
    Page 8, “Experiments”

See all papers in Proc. ACL 2014 that mention F1 Score.

See all papers in Proc. ACL that mention F1 Score.

Back to top.

feature weighting

Appears in 7 sentences as: feature weighting (4) feature weights (3)
In Active Learning with Efficient Feature Weighting Methods for Improving Data Quality and Classification Accuracy
  1. We describe computationally cheap feature weighting techniques and a novel nonlinear distribution spreading algorithm that can be used to iteratively and interactively correcting mislabeled instances to significantly improve annotation quality at low cost.
    Page 1, “Abstract”
  2. Following this idea, we develop computationally cheap feature weighting techniques to counteract such effect by boosting the weight of discriminative features, so that they would not be subdued and the instances with such features would have higher chance to be correctly classified.
    Page 4, “Feature Weighting Methods”
  3. Specifically, we propose a nonlinear distribution spreading algorithm for feature weighting .
    Page 4, “Feature Weighting Methods”
  4. The model’s ability to discriminate at the feature level can be further enhanced by leveraging the distribution of feature weights across multiple classes, e.g., multiple emotion categories funny, happy, sad, exciting, boring, etc..
    Page 5, “Feature Weighting Methods”
  5. While these feature weighting models can be used to score and rank instances for data clean-
    Page 5, “Feature Weighting Methods”
  6. ing, better classification and regression models can be built by using the feature weights generated by these models as a pre-weight on the data points for other machine learning algorithms.
    Page 5, “Feature Weighting Methods”
  7. Spreading the feature weights reduces the number of data points that must be examined in order to correct the mislabeled instances.
    Page 8, “Experiments”

See all papers in Proc. ACL 2014 that mention feature weighting.

See all papers in Proc. ACL that mention feature weighting.

Back to top.

Amazon Mechanical Turk

Appears in 6 sentences as: Amazon Mechanical Turk (4) Amazon’s Mechanical Turk (2)
In Active Learning with Efficient Feature Weighting Methods for Improving Data Quality and Classification Accuracy
  1. In this paper we study a large, low quality annotated dataset, created quickly and cheaply using Amazon Mechanical Turk to crowd-source annotations.
    Page 1, “Abstract”
  2. There are generally two ways to collect annotations of a dataset: through a few expert annotators, or through crowdsourcing services (e.g., Amazon’s Mechanical Turk ).
    Page 1, “Introduction”
  3. We employ Amazon’s Mechanical Turk (AMT) to label the emotions of Twitter data, and apply the proposed methods to the AMT dataset with the goals of improving the annotation quality at low cost, as well as learning accurate emotion classifiers.
    Page 2, “Introduction”
  4. We then sent these tweets to Amazon Mechanical Turk for annotation.
    Page 5, “Experiments”
  5. In order to evaluate our approach in real world scenarios, instead of creating a high quality annotated dataset and then introducing artificial noise, we followed the common practice of crowdsouc-ing, and collected emotion annotations through Amazon Mechanical Turk (AMT).
    Page 6, “Experiments”
  6. Amazon Mechanical Turk Annotation: we posted the set of 100K tweets to the workers on AMT for emotion annotation.
    Page 6, “Experiments”

See all papers in Proc. ACL 2014 that mention Amazon Mechanical Turk.

See all papers in Proc. ACL that mention Amazon Mechanical Turk.

Back to top.

Mechanical Turk

Appears in 6 sentences as: Mechanical Turk (6)
In Active Learning with Efficient Feature Weighting Methods for Improving Data Quality and Classification Accuracy
  1. In this paper we study a large, low quality annotated dataset, created quickly and cheaply using Amazon Mechanical Turk to crowd-source annotations.
    Page 1, “Abstract”
  2. There are generally two ways to collect annotations of a dataset: through a few expert annotators, or through crowdsourcing services (e.g., Amazon’s Mechanical Turk ).
    Page 1, “Introduction”
  3. We employ Amazon’s Mechanical Turk (AMT) to label the emotions of Twitter data, and apply the proposed methods to the AMT dataset with the goals of improving the annotation quality at low cost, as well as learning accurate emotion classifiers.
    Page 2, “Introduction”
  4. We then sent these tweets to Amazon Mechanical Turk for annotation.
    Page 5, “Experiments”
  5. In order to evaluate our approach in real world scenarios, instead of creating a high quality annotated dataset and then introducing artificial noise, we followed the common practice of crowdsouc-ing, and collected emotion annotations through Amazon Mechanical Turk (AMT).
    Page 6, “Experiments”
  6. Amazon Mechanical Turk Annotation: we posted the set of 100K tweets to the workers on AMT for emotion annotation.
    Page 6, “Experiments”

See all papers in Proc. ACL 2014 that mention Mechanical Turk.

See all papers in Proc. ACL that mention Mechanical Turk.

Back to top.

SVM

Appears in 6 sentences as: SVM (7)
In Active Learning with Efficient Feature Weighting Methods for Improving Data Quality and Classification Accuracy
  1. (2012) propose an algorithm which first trains individual SVM classifiers on several small, class-balanced, random subsets of the dataset, and then reclassifies each training instance using a majority vote of these individual classifiers.
    Page 3, “Related Work”
  2. Methods: We evaluated the overall performance relative to the common SVM bag of words approach that can be ubiquitously found in text mining literature.
    Page 7, “Experiments”
  3. o SVM-TF: Uses a bag of words SVM with term frequency weights.
    Page 7, “Experiments”
  4. SVM-Delta-IDF: Uses a bag of words SVM classification with TF.Delta-IDF weights (Formula 2) in the feature vectors before training or testing an SVM .
    Page 7, “Experiments”
  5. We built the SVM classifiers using LIB-LINEAR (Fan et al., 2008) and applied its L2-regularized support vector regression model.
    Page 7, “Experiments”
  6. Based on the dot product or SVM regression scores, we ranked the tweets by how strongly they express the emotion.
    Page 7, “Experiments”

See all papers in Proc. ACL 2014 that mention SVM.

See all papers in Proc. ACL that mention SVM.

Back to top.

weight vector

Appears in 6 sentences as: weight vector (6)
In Active Learning with Efficient Feature Weighting Methods for Improving Data Quality and Classification Accuracy
  1. We calculate the Delta IDF score of every term in V, and get the Delta IDF weight vector A = (A_z'df1, ..., A_idf|V|) for all terms.
    Page 4, “Feature Weighting Methods”
  2. When the dataset is imblanced, to avoid building a biased model, we down sample the majority class before calculating the Delta IDF score and then use the a bias balancing procedure to balance the Delta IDF weight vector .
    Page 4, “Feature Weighting Methods”
  3. This procedure first divides the Delta IDF weight vector to two vectors, one of which contains all the features with positive scores, and the other of which contains all the features with negative scores.
    Page 4, “Feature Weighting Methods”
  4. Let Vl be the vocabulary of dataset DZ, V be the vocabulary of all datasets, and |V| is the number of unique terms in V. Using Formula (1) and dataset DZ, we get the Delta IDF weight vector for each class 1: Al 2 (Aidff, ..., A_idf|lV|).
    Page 5, “Feature Weighting Methods”
  5. 0 Delta-IDF: Takes the dot product of the Delta IDF weight vector (Formula 1) with the document’s term frequency vector.
    Page 7, “Experiments”
  6. 0 Spread: Takes the dot product of the distribution spread weight vector (Formula 3) with the document’s term frequency vector.
    Page 7, “Experiments”

See all papers in Proc. ACL 2014 that mention weight vector.

See all papers in Proc. ACL that mention weight vector.

Back to top.

iteratively

Appears in 5 sentences as: iteratively (5)
In Active Learning with Efficient Feature Weighting Methods for Improving Data Quality and Classification Accuracy
  1. We describe computationally cheap feature weighting techniques and a novel nonlinear distribution spreading algorithm that can be used to iteratively and interactively correcting mislabeled instances to significantly improve annotation quality at low cost.
    Page 1, “Abstract”
  2. The process of selecting and relabeling data points can be conducted with multiple rounds to iteratively improve the data quality.
    Page 1, “Introduction”
  3. An active learner uses a small set of labeled data to iteratively select the most informative instances from a large pool of unlabeled data for human annotators to label (Settles, 2010).
    Page 1, “Introduction”
  4. In this work, we borrow the idea of active learning to interactively and iteratively correct labeling errors.
    Page 1, “Introduction”
  5. (2012) propose a solution called Active Label Correction (ALC) which iteratively presents the experts with small sets of suspected mislabeled instances at each round.
    Page 3, “Related Work”

See all papers in Proc. ACL 2014 that mention iteratively.

See all papers in Proc. ACL that mention iteratively.

Back to top.

labeled data

Appears in 5 sentences as: labeled data (5)
In Active Learning with Efficient Feature Weighting Methods for Improving Data Quality and Classification Accuracy
  1. An active learner uses a small set of labeled data to iteratively select the most informative instances from a large pool of unlabeled data for human annotators to label (Settles, 2010).
    Page 1, “Introduction”
  2. In Active Learning (Settles, 2010) a small set of labeled data is used to find documents that should be annotated from a large pool of unlabeled documents.
    Page 3, “Related Work”
  3. Due to these reasons, there is a lack of sufficient and high quality labeled data for emotion research.
    Page 6, “Experiments”
  4. Since in real world applications people are primarily concerned with how well the algorithm will work for new TV shows or movies that may not be included in the training data, we defined a test fold for each TV show or movie in our labeled data set.
    Page 7, “Experiments”
  5. Each test fold corresponded to a training fold containing all the labeled data from all the other TV shows and movies.
    Page 7, “Experiments”

See all papers in Proc. ACL 2014 that mention labeled data.

See all papers in Proc. ACL that mention labeled data.

Back to top.

learning algorithms

Appears in 4 sentences as: learning algorithm (1) learning algorithms (3)
In Active Learning with Efficient Feature Weighting Methods for Improving Data Quality and Classification Accuracy
  1. Noise tolerance techniques aim to improve the learning algorithm itself to avoid over-fitting caused by mislabeled instances in the training phase, so that the constructed classifier becomes more noise-tolerant.
    Page 2, “Related Work”
  2. Decision tree (Mingers, 1989; Vannoorenberghe and Denoeux, 2002) and boosting (Jiang, 2001; Kalaia and Servediob, 2005; Karmaker and Kwek, 2006) are two learning algorithms that have been investigated in many studies.
    Page 2, “Related Work”
  3. For example, useful information can be removed with noise elimination, since annotation errors are likely to occur on ambiguous instances that are potentially valuable for learning algorithms .
    Page 2, “Related Work”
  4. ing, better classification and regression models can be built by using the feature weights generated by these models as a pre-weight on the data points for other machine learning algorithms .
    Page 5, “Feature Weighting Methods”

See all papers in Proc. ACL 2014 that mention learning algorithms.

See all papers in Proc. ACL that mention learning algorithms.

Back to top.

machine learning

Appears in 3 sentences as: machine learning (3)
In Active Learning with Efficient Feature Weighting Methods for Improving Data Quality and Classification Accuracy
  1. Many machine learning datasets are noisy with a substantial number of mislabeled instances.
    Page 1, “Abstract”
  2. It uses the difference between the low quality label for each data point and a prediction of the label using supervised machine learning models built upon the low quality labels.
    Page 3, “Related Work”
  3. ing, better classification and regression models can be built by using the feature weights generated by these models as a pre-weight on the data points for other machine learning algorithms.
    Page 5, “Feature Weighting Methods”

See all papers in Proc. ACL 2014 that mention machine learning.

See all papers in Proc. ACL that mention machine learning.

Back to top.