Local Histograms of Character N-grams for Authorship Attribution
Escalante, Hugo Jair and Solorio, Thamar and Montes-y-Gomez, Manuel

Article Structure

Abstract

This paper proposes the use of local histograms (LH) over character n-grams for authorship attribution (AA).

Introduction

Authorship attribution (AA) is the task of deciding whom, from a set of candidates, is the author of a given document (Houvardas and Stamatatos, 2006; Luyckx and Daelemans, 2010; Stamatatos, 2009b).

Related Work

AA can be faced as a multiclass classification task with as many classes as candidate authors.

Background

This section describes preliminary information on document representations and pattern classification

Authorship Attribution With LOWBOW Representations

For AA we represent the training documents of each author using the framework described in Section 3.2, thus each document of each candidate author is either a LOWBOW histogram or a bag of local histograms (BOLH).

Experiments and Results

For our experiments we considered the data set used in (Plakias and Stamatatos, 2008b; Plakias and Stamatatos, 2008a).

Conclusions

We have described the use of local histograms (LH) over character n-grams for AA.

Topics

n-grams

Appears in 26 sentences as: n-grams (29)
In Local Histograms of Character N-grams for Authorship Attribution
  1. This paper proposes the use of local histograms (LH) over character n-grams for authorship attribution (AA).
    Page 1, “Abstract”
  2. In this work we explore the suitability of LHs over n-grams at the character-level for AA.
    Page 1, “Abstract”
  3. We report experimental results in AA data sets that confirm that LHs over character n-grams are more helpful for AA than the usual global histograms, yielding results far superior to state of the art approaches.
    Page 1, “Abstract”
  4. In particular, we consider local histograms over n-grams at the character-level obtained via the locally-weighted bag of words (LOWBOW) framework (Lebanon et al., 2007).
    Page 1, “Introduction”
  5. Results confirm that local histograms of character n-grams are more helpful for AA than the usual global histograms of words or character n-grams (Luyckx and Daelemans, 2010); our results are superior to those reported in related works.
    Page 2, “Introduction”
  6. We also show that local histograms over character n-grams are more helpful than local histograms over words, as originally proposed by (Lebanon et al., 2007).
    Page 2, “Introduction”
  7. 0 We propose the use of local histograms over character-level n-grams for AA.
    Page 2, “Introduction”
  8. Some researchers have gone a step further and have attempted to capture sequential information by using n-grams at the word-level (Peng et al., 2004) or by discovering maximal frequent word sequences (Coyotl-Morales et al., 2006).
    Page 2, “Related Work”
  9. Unfortunately, because of computational limitations, the latter methods cannot discover enough sequential information from documents (e.g., word n-grams are often restricted to n E {1,2,3}, while full sequential information would be obtained with n E {l .
    Page 2, “Related Work”
  10. Stamatatos and coworkers have studied the impact of feature selection, with character n-grams, in AA (Houvardas and Stamatatos, 2006; Stamatatos, 2006a), ensemble learning with character n-grams (Stamatatos, 2006b) and novel classification techniques based
    Page 2, “Related Work”
  11. However, as with word-based features, character n-grams are unable to incorporate sequential information from documents in their original form (in terms of the positions in which the terms appear across a document).
    Page 3, “Related Work”

See all papers in Proc. ACL 2011 that mention n-grams.

See all papers in Proc. ACL that mention n-grams.

Back to top.

SVM

Appears in 10 sentences as: SVM (11)
In Local Histograms of Character N-grams for Authorship Attribution
  1. applied to this problem, including support vector machine ( SVM ) classifiers (Houvardas and Stamatatos, 2006) and variants thereon (Plakias and Stamatatos, 2008b; Plakias and Stamatatos, 2008a), neural networks (Tearle et al., 2008), Bayesian classifiers (Coyotl-Morales et al., 2006), decision tree methods (Koppel et al., 2009) and similarity based techniques (Keselj et al., 2003; Lambers and Veenman, 2009; Stamatatos, 2009b; Koppel et al., 2009).
    Page 2, “Related Work”
  2. In this work, we chose an SVM classifier as it has reported acceptable performance in AA and because it will allow us to directly compare results with previous work that has used this same classifier.
    Page 2, “Related Work”
  3. For both types of representations we consider an SVM classifier under the one-vs-all formulation for facing the AA problem.
    Page 5, “Authorship Attribution With LOWBOW Representations”
  4. We consider SVM as base classifier because this method has proved to be very effective in a large number of applications, including AA (Houvardas and Stamatatos, 2006; Plakias and Stamatatos, 2008b; Plakias and Stamatatos, 2008a); further, since SVMs are kernel-based methods, they allow us to use local histograms for AA by considering kernels that work over sets of histograms.
    Page 5, “Authorship Attribution With LOWBOW Representations”
  5. We build a multiclass SVM classifier by considering the pairs of patterns-outputs associated to documents-authors.
    Page 5, “Authorship Attribution With LOWBOW Representations”
  6. All our experiments use the SVM implementation provided by Canu et al.
    Page 6, “Experiments and Results”
  7. Columns show the true author for test documents and rows show the authors predicted by the SVM .
    Page 7, “Experiments and Results”
  8. The SVM with BOW representation of character n-grams achieved recognition rates of 40% and 50% for BL and JM respectively.
    Page 7, “Experiments and Results”
  9. For these experiments we compare the performance of the BOW, LOWBOW histogram and BOLH representations; for the latter, we considered the best setting as reported in Table 3 (i.e., an SVM with diffusion kernel and k = 20).
    Page 8, “Experiments and Results”
  10. An SVM under the BOLH representation is less sensitive to the number of training examples available and to the imbalance of data than an SVM using the BOW representation.
    Page 8, “Experiments and Results”

See all papers in Proc. ACL 2011 that mention SVM.

See all papers in Proc. ACL that mention SVM.

Back to top.

word-level

Appears in 7 sentences as: word-level (7)
In Local Histograms of Character N-grams for Authorship Attribution
  1. Also, we empirically show that local histograms at the character-level are more helpful than local histograms at the word-level for AA.
    Page 2, “Introduction”
  2. Some researchers have gone a step further and have attempted to capture sequential information by using n-grams at the word-level (Peng et al., 2004) or by discovering maximal frequent word sequences (Coyotl-Morales et al., 2006).
    Page 2, “Related Work”
  3. One should note that, in general, better performance was obtained when using character-level rather than word-level information.
    Page 7, “Experiments and Results”
  4. This confirms the results already reported by other researchers that have used character-level and word-level information for AA (Houvardas and Stamatatos, 2006;
    Page 7, “Experiments and Results”
  5. Also, n-gram information is more dense in documents than word-level information.
    Page 7, “Experiments and Results”
  6. The improvements of local histograms over the BOW formulation vary across different settings and when using information at word-level and character-level.
    Page 8, “Experiments and Results”
  7. sults using character-level information are, in general, significantly better than those obtained with word-level information; hence, improvements are expected to be smaller.
    Page 8, “Experiments and Results”

See all papers in Proc. ACL 2011 that mention word-level.

See all papers in Proc. ACL that mention word-level.

Back to top.

n-gram

Appears in 5 sentences as: n-gram (5)
In Local Histograms of Character N-grams for Authorship Attribution
  1. (2003) propose the use of language models at the n-gram character-level for AA, whereas Keselj et al.
    Page 2, “Related Work”
  2. on characters at the n-gram level (Plakias and Stamatatos, 2008a).
    Page 3, “Related Work”
  3. Acceptable performance in AA has been reported with character n-gram representations.
    Page 3, “Related Work”
  4. For our character n-gram experiments, we obtained LOWBOW representations for character 3-grams (only n-grams of size n = 3 were used) considering the 2, 500 most common n-grams.
    Page 6, “Experiments and Results”
  5. Also, n-gram information is more dense in documents than word-level information.
    Page 7, “Experiments and Results”

See all papers in Proc. ACL 2011 that mention n-gram.

See all papers in Proc. ACL that mention n-gram.

Back to top.

Support vector

Appears in 5 sentences as: Support vector (2) support vector (2) support vectors (1)
In Local Histograms of Character N-grams for Authorship Attribution
  1. 0 We study several kernels for a support vector machine AA classifier under the local histograms formulation.
    Page 2, “Introduction”
  2. applied to this problem, including support vector machine (SVM) classifiers (Houvardas and Stamatatos, 2006) and variants thereon (Plakias and Stamatatos, 2008b; Plakias and Stamatatos, 2008a), neural networks (Tearle et al., 2008), Bayesian classifiers (Coyotl-Morales et al., 2006), decision tree methods (Koppel et al., 2009) and similarity based techniques (Keselj et al., 2003; Lambers and Veenman, 2009; Stamatatos, 2009b; Koppel et al., 2009).
    Page 2, “Related Work”
  3. 3.3 Support vector machines
    Page 4, “Background”
  4. Support vector machines (SVMs) are pattern classification methods that aim to find an optimal separating hyperplane between examples from two different classes (Shawe-Taylor and Cristianini, 2004).
    Page 4, “Background”
  5. that is, a linear function over (a subset of) training examples, where 04,- is the weight associated with training example 2' (those for which a, > 0 are the so called support vectors ) and y,- is the label associated with training example i, K (xi, xj) is a kernel2 function that aims at mapping the input vectors, (xi, xj), into the so called feature space, and b is a bias term.
    Page 4, “Background”

See all papers in Proc. ACL 2011 that mention Support vector.

See all papers in Proc. ACL that mention Support vector.

Back to top.

state of the art

Appears in 4 sentences as: state of the art (4)
In Local Histograms of Character N-grams for Authorship Attribution
  1. We report experimental results in AA data sets that confirm that LHs over character n-grams are more helpful for AA than the usual global histograms, yielding results far superior to state of the art approaches.
    Page 1, “Abstract”
  2. c We report experimental results that are superior to state of the art approaches (Plakias and Stamatatos, 2008b; Plakias and Stamatatos, 2008a), with improvements ranging from 2% — 6% in balanced data sets and from 14% — 30% in imbalanced data sets.
    Page 2, “Introduction”
  3. The BOLH formulation outperforms state of the art approaches by a considerable margin that ranges from 10% to 27%.
    Page 8, “Experiments and Results”
  4. Our experimental results showed that LHs outperform traditional bag-of-words formulations and state of the art techniques in balanced, imbalanced, and reduced data sets.
    Page 9, “Conclusions”

See all papers in Proc. ACL 2011 that mention state of the art.

See all papers in Proc. ACL that mention state of the art.

Back to top.