Topic Modeling Based Classification of Clinical Reports
Sarioglu, Efsun and Yadav, Kabir and Choi, Hyeong-Ah

Article Structure

Abstract

Electronic health records (EHRs) contain important clinical information about patients.

Introduction

Large amounts of medical data are now stored as electronic health records (EHRs).

Background

This study utilized prospective patient data previously collected for a traumatic orbital fracture project (Yadav et al., 2012).

Related Work

For text classification, topic modeling techniques have been utilized in various ways.

Experiments

Figure 1 shows the three approaches of using topic model of clinical reports to classify them and they are explained below.

Results

Classification results using ATC and SVM are shown in Figures 2, 3, and 4 for precision, recall, and f-score respectively.

Conclusion

In this study, topic modeling of clinical reports are utilized in different ways with the end goal of classification.

Topics

topic model

Appears in 23 sentences as: topic model (8) Topic Modeling (3) Topic modeling (5) topic modeling (5) topic models (3)
In Topic Modeling Based Classification of Clinical Reports
  1. In addition to regular text classification, we utilized topic modeling of the entire dataset in various ways.
    Page 1, “Abstract”
  2. Topic modeling of the corpora provides interpretable themes that exist in these reports.
    Page 1, “Abstract”
  3. A binary topic model was also built as an unsupervised classification approach with the assumption that each topic corresponds to a class.
    Page 1, “Abstract”
  4. In this study, we developed several topic modeling based classification systems for clinical reports.
    Page 1, “Introduction”
  5. Topic modeling is an unsupervised technique that can automatically identify themes from a given set of documents and find topic distributions of each document.
    Page 1, “Introduction”
  6. Therefore, topic model output of patient reports could contain very useful clinical information.
    Page 1, “Introduction”
  7. 2.2 Topic Modeling
    Page 2, “Background”
  8. Topic modeling is an unsupervised learning algorithm that can automatically discover themes of a document collection.
    Page 2, “Background”
  9. Either sampling methods such as Gibbs Sampling (Griffiths and Steyvers, 2004) or optimization methods such as variational Bayes approximation (Asuncion et al., 2009) can be used to train a topic model based on LDA.
    Page 2, “Background”
  10. For text classification, topic modeling techniques have been utilized in various ways.
    Page 2, “Related Work”
  11. In our study, we removed the most frequent and infrequent words to have a manageable vocabulary size but we did not utilize topic model output for this purpose.
    Page 2, “Related Work”

See all papers in Proc. ACL 2013 that mention topic model.

See all papers in Proc. ACL that mention topic model.

Back to top.

SVM

Appears in 14 sentences as: SVM (14)
In Topic Modeling Based Classification of Clinical Reports
  1. Support vector machines ( SVM ) is a popular classification algorithm that attempts to find a decision boundary between classes that is the farthest from any point in the training dataset.
    Page 2, “Background”
  2. Given labeled training data (mt,yt),t = 1, ...,N where 30,; 6 RM and y; E {1, —1}, SVM tries to find a separating hyperplane with the maximum margin (Platt, 1998).
    Page 2, “Background”
  3. tor classification results with SVM , however (Sri-urai, 2011) uses a fixed number of topics, whereas we evaluated different number of topics since typically this is not known in advance.
    Page 3, “Related Work”
  4. SVM was chosen as the classification algorithm as it was shown that it performs well in text classification tasks (J oachims, 1998; Yang and Liu, 1999) and it is robust to overfitting (Sebastiani, 2002).
    Page 3, “Experiments”
  5. Accordingly, the raw text of the reports and topic vectors are compiled into individual files with their corresponding outcomes in ARFF and then classified with SVM .
    Page 3, “Experiments”
  6. Classification results using ATC and SVM are shown in Figures 2, 3, and 4 for precision, recall, and f-score respectively.
    Page 4, “Results”
  7. Best classification performance was achieved with 15 topics for ATC and 100 topics for SVM .
    Page 4, “Results”
  8. For smaller number of topics, ATC performed better than SVM .
    Page 4, “Results”
  9. However, using topic vectors to represent reports still provided great dimension reduction as raw text of the reports had 1,296 terms and made the subsequent classification with SVM faster.
    Page 4, “Results”
  10. 095_ N ‘ '7 ' SVM ' I x ATC 0.93: N 3 —SVM 0.91- ATC
    Page 4, “Results”
  11. — f ' - SVM 0.95—093: o x a x ATC ~ _ —SVM 0.91: ATC 0.99— _ .
    Page 4, “Results”

See all papers in Proc. ACL 2013 that mention SVM.

See all papers in Proc. ACL that mention SVM.

Back to top.

LDA

Appears in 6 sentences as: LDA (6)
In Topic Modeling Based Classification of Clinical Reports
  1. Several techniques can be used for this purpose such as Latent Semantic Analysis (LSA) (Deerwester et al., 1990), Probabilistic Latent Semantic Analysis (PLSA) (Hofmann, 1999), and Latent Dirichlet Allocation ( LDA ) (Blei et al., 2003).
    Page 2, “Background”
  2. LDA , first defined by (Blei et al., 2003), defines topic as a distribution over a fixed vocabulary, where each document can exhibit them with different proportions.
    Page 2, “Background”
  3. For each document, LDA generates the words in a two-step process:
    Page 2, “Background”
  4. Either sampling methods such as Gibbs Sampling (Griffiths and Steyvers, 2004) or optimization methods such as variational Bayes approximation (Asuncion et al., 2009) can be used to train a topic model based on LDA .
    Page 2, “Background”
  5. LDA performs better than PLSA for small datasets since it avoids overfitting and it supports polysemy (Blei et al., 2003).
    Page 2, “Background”
  6. LDA was chosen to generate the topic models of clinical reports due to its being a generative probabilistic system for documents and its robustness to overfitting.
    Page 3, “Experiments”

See all papers in Proc. ACL 2013 that mention LDA.

See all papers in Proc. ACL that mention LDA.

Back to top.

text classification

Appears in 6 sentences as: Text Classification (1) Text classification (1) text classification (4)
In Topic Modeling Based Classification of Clinical Reports
  1. In addition to regular text classification , we utilized topic modeling of the entire dataset in various ways.
    Page 1, “Abstract”
  2. Our proposed topic based classifier system is shown to be competitive with existing text classification techniques and provides a more efficient and interpretable representation.
    Page 1, “Abstract”
  3. 2.3 Text Classification
    Page 2, “Background”
  4. Text classification is a supervised learning algorithm where documents’ categories are learned from pre-labeled set of documents.
    Page 2, “Background”
  5. For text classification , topic modeling techniques have been utilized in various ways.
    Page 2, “Related Work”
  6. SVM was chosen as the classification algorithm as it was shown that it performs well in text classification tasks (J oachims, 1998; Yang and Liu, 1999) and it is robust to overfitting (Sebastiani, 2002).
    Page 3, “Experiments”

See all papers in Proc. ACL 2013 that mention text classification.

See all papers in Proc. ACL that mention text classification.

Back to top.

topic distributions

Appears in 5 sentences as: topic distribution (1) topic distributions (4)
In Topic Modeling Based Classification of Clinical Reports
  1. Representing reports according to their topic distributions is more compact than bag-of-words representation and can be processed faster than raw text in subsequent automated processes.
    Page 1, “Abstract”
  2. Topic modeling is an unsupervised technique that can automatically identify themes from a given set of documents and find topic distributions of each document.
    Page 1, “Introduction”
  3. Representing reports according to their topic distributions is more compact and can be processed faster than raw text in subsequent automated processing.
    Page 1, “Introduction”
  4. Topic modeling of reports produces a topic distribution for each report which can be used to represent them as topic vectors.
    Page 3, “Experiments”
  5. With this approach, a representative topic vector for each class was composed by averaging their corresponding topic distributions in the training dataset.
    Page 3, “Experiments”

See all papers in Proc. ACL 2013 that mention topic distributions.

See all papers in Proc. ACL that mention topic distributions.

Back to top.

bag-of-words

Appears in 4 sentences as: Bag-of-Words (1) bag-of-words (3)
In Topic Modeling Based Classification of Clinical Reports
  1. Representing reports according to their topic distributions is more compact than bag-of-words representation and can be processed faster than raw text in subsequent automated processes.
    Page 1, “Abstract”
  2. 2.1 Bag-of-Words (BOW) Representation
    Page 1, “Background”
  3. One way of doing this is bag-of-words (BoW) representation where each document becomes a vector of its words/tokens.
    Page 1, “Background”
  4. Firstly, bag-of-words representation is replaced with topic vectors which provide good dimensionality reduction and still get comparable classification performance.
    Page 6, “Conclusion”

See all papers in Proc. ACL 2013 that mention bag-of-words.

See all papers in Proc. ACL that mention bag-of-words.

Back to top.

overfitting

Appears in 4 sentences as: overfitting (4)
In Topic Modeling Based Classification of Clinical Reports
  1. PLSA solves the polysemy problem; however it is not considered a fully generative model of documents and it is known to be overfitting (Blei et al., 2003).
    Page 2, “Background”
  2. LDA performs better than PLSA for small datasets since it avoids overfitting and it supports polysemy (Blei et al., 2003).
    Page 2, “Background”
  3. LDA was chosen to generate the topic models of clinical reports due to its being a generative probabilistic system for documents and its robustness to overfitting .
    Page 3, “Experiments”
  4. SVM was chosen as the classification algorithm as it was shown that it performs well in text classification tasks (J oachims, 1998; Yang and Liu, 1999) and it is robust to overfitting (Sebastiani, 2002).
    Page 3, “Experiments”

See all papers in Proc. ACL 2013 that mention overfitting.

See all papers in Proc. ACL that mention overfitting.

Back to top.

learning algorithm

Appears in 3 sentences as: learning algorithm (2) learning algorithms (1)
In Topic Modeling Based Classification of Clinical Reports
  1. Topic modeling is an unsupervised learning algorithm that can automatically discover themes of a document collection.
    Page 2, “Background”
  2. Text classification is a supervised learning algorithm where documents’ categories are learned from pre-labeled set of documents.
    Page 2, “Background”
  3. Weka was used to conduct classification which is a collection of machine learning algorithms for data mining tasks written in Java (Hall et al., 2009).
    Page 3, “Experiments”

See all papers in Proc. ACL 2013 that mention learning algorithm.

See all papers in Proc. ACL that mention learning algorithm.

Back to top.