Text Classification based on the Latent Topics of Important Sentences extracted by the PageRank Algorithm
Ogura, Yukari and Kobayashi, Ichiro

Article Structure

Abstract

In this paper, we propose a method to raise the accuracy of text classification based on latent topics, reconsidering the techniques necessary for good classification — for example, to decide important sentences in a document, the sentences with important words are usually regarded as important sentences.

Introduction

Text classification is an essential issue in the field of natural language processing and many techniques using latent topics have so far been proposed and used under many purposes.

Related studies

Many studies have proposed to improve the accuracy of text classification.

Techniques for text classification

3.1 Extraction of important words

Experiment

We evaluate our proposed method by comparing the accuracy of document clustering between our method and the method using tf.

Conclusions

In this study, we have proposed a method of text clustering based on latent topics of important sentences in a document.

Topics

PageRank

Appears in 31 sentences as: PageRank (34)
In Text Classification based on the Latent Topics of Important Sentences extracted by the PageRank Algorithm
  1. On the other hand, we apply the PageRank algorithm to rank important words in each document.
    Page 1, “Abstract”
  2. Whereas, we apply the PageRank algorithm (Brin et al., 1998) for the issue, because the algorithm scores the centrality of a node in a graph, and important words should be regarded as having the centrality (Hassan et al., 2007).
    Page 1, “Introduction”
  3. ment for text classification, there are many studies which use the PageRank algorithm.
    Page 2, “Related studies”
  4. They apply topic-specific PageRank to a graph of both words and documents, and introduce Polarity PageRank , a new semi-supervised sentiment classifier that integrates lexicon induction with document classification.
    Page 2, “Related studies”
  5. As a study related to topic detection by important words obtained by the PageRank algorithm, Kubek et al.
    Page 2, “Related studies”
  6. (2011) has detected topics in a document by constructing a graph of word co-occurrence and applied the PageRank algorithm on it.
    Page 2, “Related studies”
  7. (2004b; 2004a) have proposed multi-document summarization methods using the PageRank algorithm, called LexRank and TextRank, respectively.
    Page 2, “Related studies”
  8. They use PageRank scores to extract sentences which have centrality among other sentences for generating a summary from multi-documents.
    Page 2, “Related studies”
  9. The graph used in our method is constructed based on word co-occurrence so that important words which are sensitive to latent information can be extracted by the PageRank algorithm.
    Page 2, “Related studies”
  10. In particular, (Hassan et al., 2007) shows that the PageRank score is more clear to rank important words rather than tfidf.
    Page 2, “Techniques for text classification”
  11. In this study, we refer to their method and use PageRank algorithm to decide important words.
    Page 2, “Techniques for text classification”

See all papers in Proc. ACL 2013 that mention PageRank.

See all papers in Proc. ACL that mention PageRank.

Back to top.

text classification

Appears in 12 sentences as: Text classification (1) text classification (11)
In Text Classification based on the Latent Topics of Important Sentences extracted by the PageRank Algorithm
  1. In this paper, we propose a method to raise the accuracy of text classification based on latent topics, reconsidering the techniques necessary for good classification — for example, to decide important sentences in a document, the sentences with important words are usually regarded as important sentences.
    Page 1, “Abstract”
  2. Text classification is an essential issue in the field of natural language processing and many techniques using latent topics have so far been proposed and used under many purposes.
    Page 1, “Introduction”
  3. In this paper, we aim to raise the accuracy of text classification using latent information by reconsidering elemental techniques necessary for good classification in the following three points: 1) important words extraction
    Page 1, “Introduction”
  4. — to decide important words in documents is a crucial issue for text classification , tfidf is often used to decide them.
    Page 1, “Introduction”
  5. So, we construct a graph from a viewpoint of text classification based on latent topics.
    Page 1, “Introduction”
  6. We experiment text classification with Reuters-21578 corpus; evaluate the result of our method with the results of those which have various other settings for classification; and show the usefulness of our proposed method.
    Page 1, “Introduction”
  7. Many studies have proposed to improve the accuracy of text classification .
    Page 1, “Related studies”
  8. ment for text classification , there are many studies which use the PageRank algorithm.
    Page 2, “Related studies”
  9. (2005) have introduced association rule mining to decide important words for text classification .
    Page 2, “Related studies”
  10. have used a PageRank—style algorithm to rank words and shown their method is useful for text classification .
    Page 2, “Related studies”
  11. To weight words is not the issue for only text classification , but also an important issue for text summarization, Erkan et al.
    Page 2, “Related studies”

See all papers in Proc. ACL 2013 that mention text classification.

See all papers in Proc. ACL that mention text classification.

Back to top.

co-occurrence

Appears in 8 sentences as: co-occurrence (8)
In Text Classification based on the Latent Topics of Important Sentences extracted by the PageRank Algorithm
  1. In our study, we express the relation of word co-occurrence in the form of a graph.
    Page 1, “Introduction”
  2. (2011) has detected topics in a document by constructing a graph of word co-occurrence and applied the PageRank algorithm on it.
    Page 2, “Related studies”
  3. The graph used in our method is constructed based on word co-occurrence so that important words which are sensitive to latent information can be extracted by the PageRank algorithm.
    Page 2, “Related studies”
  4. According to (Newman et al., 2010), topic coherence is related to word co-occurrence .
    Page 2, “Techniques for text classification”
  5. The refined documents are composed of the important sentences extracted from a viewpoint of latent information, i.e., word co-occurrence , so they are proper to be classified based on latent information.
    Page 2, “Techniques for text classification”
  6. In our study, we construct a graph based on word co-occurrence .
    Page 3, “Techniques for text classification”
  7. So, important words are selected based on the words which have centrality in terms of word co-occurrence .
    Page 3, “Techniques for text classification”
  8. By this, constructing a graph based on word co-occurrence of each 3 sentences in a document works well to rank important words, taking account of the context of the word.
    Page 5, “Experiment”

See all papers in Proc. ACL 2013 that mention co-occurrence.

See all papers in Proc. ACL that mention co-occurrence.

Back to top.

probabilistic distribution

Appears in 7 sentences as: probabilistic distribution (4) probabilistic distributions (2) probability distribution (3)
In Text Classification based on the Latent Topics of Important Sentences extracted by the PageRank Algorithm
  1. After obtaining a collection of refined documents for classification, we adopt LDA to estimate the latent topic probabilistic distributions over the target documents and use them for clustering.
    Page 3, “Techniques for text classification”
  2. In this study, we use the topic probability distribution over documents to make a topic vector for each document, and then calculate the similarity among documents.
    Page 3, “Techniques for text classification”
  3. Here, N is the number of all words in the target documents, wmn is the nth word in the m-th document; 6 is the topic probabilistic distribution for the documents, and gb is the word probabilistic distribution for every topic.
    Page 3, “Techniques for text classification”
  4. Furthermore, the hyper-parameters for topic probability distribution and word probability distribution in LDA are a=0.5 and [3:05, respectively.
    Page 4, “Experiment”
  5. Here, in the case of clustering the documents based on the topic probabilistic distribution by LDA, the topic distribution over documents 6 is changed in every estimation.
    Page 4, “Experiment”
  6. To measure the latent similarity among documents, we construct topic vectors with the topic probabilistic distribution , and then adopt the Jensen-Shannon divergence to measures it, on the other hand, in the case of using document vectors we adopt cosine similarity.
    Page 4, “Experiment”
  7. We think the reason for this is because only important sentences representing the contents of a document are remained by refining the original documents and then it would become easier to measure the difference between probabilistic distributions of topics in a document.
    Page 5, “Experiment”

See all papers in Proc. ACL 2013 that mention probabilistic distribution.

See all papers in Proc. ACL that mention probabilistic distribution.

Back to top.

Cosine similarity

Appears in 5 sentences as: Cosine similarity (3) cosine similarity (2)
In Text Classification based on the Latent Topics of Important Sentences extracted by the PageRank Algorithm
  1. To measure the latent similarity among documents, we construct topic vectors with the topic probabilistic distribution, and then adopt the Jensen-Shannon divergence to measures it, on the other hand, in the case of using document vectors we adopt cosine similarity .
    Page 4, “Experiment”
  2. Table 1: Extracting important sentences Methods Measure Accuracy F-value PageRank J enshen-Shannon 0.567 0.485 Cosine similarity 0.287 0.291 tf.
    Page 4, “Experiment”
  3. idf J enshen-Shannon 0.550 0.43 5 Cosine similarity 0.275 0.270
    Page 4, “Experiment”
  4. Table 2: Without extracting important sentences Similarity measure Accuracy F—value J enshen- Shannon 0.518 0.426 Cosine similarity 0.288 0.305
    Page 4, “Experiment”
  5. The reason for low accuracy in the case of using cosine similarity for clustering is that it was observed that the range of similarity between documents is small, therefore, the identification of different categorized documents was not well achieved.
    Page 5, “Experiment”

See all papers in Proc. ACL 2013 that mention Cosine similarity.

See all papers in Proc. ACL that mention Cosine similarity.

Back to top.

LDA

Appears in 5 sentences as: LDA (5)
In Text Classification based on the Latent Topics of Important Sentences extracted by the PageRank Algorithm
  1. 3) Information used for classification — we use latent information estimated by latent Dirichlet allocation ( LDA ) (Blei et al., 2003) to classify documents, and compare the results of the cases using both surface and latent information.
    Page 1, “Introduction”
  2. After obtaining a collection of refined documents for classification, we adopt LDA to estimate the latent topic probabilistic distributions over the target documents and use them for clustering.
    Page 3, “Techniques for text classification”
  3. As for the refined document obtained in step 2, the latent topics are estimated by means of LDA .
    Page 3, “Techniques for text classification”
  4. Furthermore, the hyper-parameters for topic probability distribution and word probability distribution in LDA are a=0.5 and [3:05, respectively.
    Page 4, “Experiment”
  5. Here, in the case of clustering the documents based on the topic probabilistic distribution by LDA , the topic distribution over documents 6 is changed in every estimation.
    Page 4, “Experiment”

See all papers in Proc. ACL 2013 that mention LDA.

See all papers in Proc. ACL that mention LDA.

Back to top.