Generating Synthetic Comparable Questions for News Articles
Rokhlenko, Oleg and Szpektor, Idan

Article Structure

Abstract

We introduce the novel task of automatically generating questions that are relevant to a text but do not appear in it.

Introduction

For companies whose revenues are mainly ad-based, e. g. Facebook, Google and Yahoo, increasing user engagement is an important goal, leading to more time spent on site and consequently to increased exposure to ads.

Motivation and Algorithmic Overview

Before we detail our algorithm, we provide some motivations and insights to the design choices we took in our algorithm, which also indicate the difficulties inherent in the task.

Comparable Question Mining

To suggest comparable questions our algorithm needs a database of question templates.

Online Question Generation

The online part of our automatic generation algorithm takes as input a news article and generates concrete comparable questions for it.

Evaluation

5.1 Experimental Settings

Related Work

Traditionally, question generation focuses on converting assertions in a text into question forms (Brown et al., 2005; Mitkov et al., 2006; Myller, 2007; Heilman and Smith, 2010; Rus et al., 2010; Agarwal et al., 2011; Olney et al., 2012).

Conclusions

We introduced the novel task of automatically generating synthetic comparable questions that are relevant to a given news article but do not necessarily appear in it.

Topics

named entities

Appears in 22 sentences as: named entities (19) named entity (4)
In Generating Synthetic Comparable Questions for News Articles
  1. Looking at the structure of comparable questions, we observed that a specific comparable relation, such as ‘better dad’ and ‘faster’, can usually be combined with named entities in several syntactic ways to construct a concrete question.
    Page 2, “Motivation and Algorithmic Overview”
  2. As a preprocessing step for detecting comparable relations, our extraction algorithm identifies all the named entities of interest in our corpus, keeping only questions that contain at least two entities.
    Page 3, “Comparable Question Mining”
  3. Answers containing two named entities , e. g. “Is #1 dating #2 ?”, our CRF tagger is trained to detect only comparable relations like “Who is prettier #1 or #2 ?”.
    Page 3, “Comparable Question Mining”
  4. Input: A news article Output: A sorted list of comparable questions 1: Identify all target named entities (NEs) in the article 2: Infer the distribution of LDA topics for the article 3: For each comparable relation R in the database, compute its relevance score to be the similarity between the topic distributions of R and the article 4: Rank all the relations according to their relevance score and pick the top M as relevant 5: for each relevant relation R in the order of relevance ranking do 6: Filter out all the target NEs that do not pass the single entity classifier for R 7: Generate all possible NE pairs from the those that passed the single classifier 8: Filter out all the generated NE pairs that do not pass the entity pair classifier for R 9: Pick up the top N pairs with positive classification score to be qualified for generation
    Page 4, “Comparable Question Mining”
  5. For each relevant relation, we then generate concrete questions by picking generic templates that are applicable for this relation and instantiating them with pairs of named entities appearing in the article.
    Page 4, “Online Question Generation”
  6. To this end, we utilize two different broad-scale sources of information about named entities .
    Page 5, “Online Question Generation”
  7. The first is DBPedia3, which contains structured information on entries in Wikipedia, many of them are named entities that appear in news articles.
    Page 5, “Online Question Generation”
  8. For named entities with a DBPedia entry, we extract all the DBPedia properties of classes subject and type as indicator features.
    Page 5, “Online Question Generation”
  9. The majority pronoun sex is then chosen to be the gender of the named entity , or none if the histogram is empty.
    Page 5, “Online Question Generation”
  10. For each named entity , we construct a histogram of the number of questions containing it that are assigned to each category.
    Page 5, “Online Question Generation”
  11. This histogram is normalized into a probability distribution with Laplace smoothing of 0.03, to incorporate the uncertainty that lies in named entities that appear only very few times.
    Page 5, “Online Question Generation”

See all papers in Proc. ACL 2013 that mention named entities.

See all papers in Proc. ACL that mention named entities.

Back to top.

news article

Appears in 18 sentences as: news article (11) news articles (7)
In Generating Synthetic Comparable Questions for News Articles
  1. One motivating example of its application is for increasing user engagement around news articles by suggesting relevant comparable questions, such as “is Beyonce a better singer than Madonna .7”, for the user to answer.
    Page 1, “Abstract”
  2. In this paper we propose a new way to increase user engagement around news articles , namely suggesting questions for the user to answer, which are related to the viewed article.
    Page 1, “Introduction”
  3. Sadly, fun and engaging comparative questions are typically not found within the text of news articles .
    Page 1, “Introduction”
  4. However, it is highly unlikely that such sources will contain enough relevant questions for any news article due to typical sparseness issues as well as differences in interests between askers in CQA sites and news reporters.
    Page 1, “Introduction”
  5. To better address the motivating application above, we propose the novel task of automatically suggesting comparative questions that are relevant to a given input news article but do not appear in it.
    Page 1, “Introduction”
  6. For a given news article , an online part chooses relevant tem-
    Page 1, “Introduction”
  7. Figure 1: An example news article from OMG!
    Page 2, “Introduction”
  8. To test the performance of our algorithm, we conducted a Mechanical Turk experiment that assessed the quality of suggested questions for news articles on celebrities.
    Page 2, “Introduction”
  9. Given a news article , our algorithm generates a set of comparable questions for the article from question templates, e.g.
    Page 2, “Motivation and Algorithmic Overview”
  10. Input: A news article Output: A sorted list of comparable questions 1: Identify all target named entities (NEs) in the article 2: Infer the distribution of LDA topics for the article 3: For each comparable relation R in the database, compute its relevance score to be the similarity between the topic distributions of R and the article 4: Rank all the relations according to their relevance score and pick the top M as relevant 5: for each relevant relation R in the order of relevance ranking do 6: Filter out all the target NEs that do not pass the single entity classifier for R 7: Generate all possible NE pairs from the those that passed the single classifier 8: Filter out all the generated NE pairs that do not pass the entity pair classifier for R 9: Pick up the top N pairs with positive classification score to be qualified for generation
    Page 4, “Comparable Question Mining”
  11. The online part of our automatic generation algorithm takes as input a news article and generates concrete comparable questions for it.
    Page 4, “Online Question Generation”

See all papers in Proc. ACL 2013 that mention news article.

See all papers in Proc. ACL that mention news article.

Back to top.

CRF

Appears in 10 sentences as: CRF (11)
In Generating Synthetic Comparable Questions for News Articles
  1. Therefore, we decided to employ a Conditional Random Fields ( CRF ) tag-
    Page 3, “Comparable Question Mining”
  2. ger (Lafferty et al., 2001) to the task, since CRF was shown to be state-of-the-art for sequential relation extraction (Mooney and Bunescu, 2005; Culotta et al., 2006; J indal and Liu, 2006).
    Page 3, “Comparable Question Mining”
  3. This transformation helps us to design a simpler CRF than that of (Jindal and Liu, 2006), since our CRF utilizes the known positions of the target entities in the text.
    Page 3, “Comparable Question Mining”
  4. To train the CRF model, the authors manually tagged all comparable relation words in approximately 300 transformed questions in the filtered corpus.
    Page 3, “Comparable Question Mining”
  5. The local and global features for the CRF , which we induce from each question word, are specified in Figures 3 and 4 respectively.
    Page 3, “Comparable Question Mining”
  6. Answers containing two named entities, e. g. “Is #1 dating #2 ?”, our CRF tagger is trained to detect only comparable relations like “Who is prettier #1 or #2 ?”.
    Page 3, “Comparable Question Mining”
  7. The authors conducted a manual evaluation of the CRF tagger performance, which showed 80% precision per occurrence.
    Page 3, “Comparable Question Mining”
  8. Figure 3: CRF local features for each word
    Page 4, “Comparable Question Mining”
  9. Figure 4: CRF global features for each word
    Page 4, “Comparable Question Mining”
  10. Our extraction of comparable relations falls within the field of Relation Extraction, in which CRF is a state-of-the-art method (Mooney and Bunescu, 2005; Culotta et al., 2006).
    Page 8, “Related Work”

See all papers in Proc. ACL 2013 that mention CRF.

See all papers in Proc. ACL that mention CRF.

Back to top.

LDA

Appears in 7 sentences as: LDA (8)
In Generating Synthetic Comparable Questions for News Articles
  1. Input: A news article Output: A sorted list of comparable questions 1: Identify all target named entities (NEs) in the article 2: Infer the distribution of LDA topics for the article 3: For each comparable relation R in the database, compute its relevance score to be the similarity between the topic distributions of R and the article 4: Rank all the relations according to their relevance score and pick the top M as relevant 5: for each relevant relation R in the order of relevance ranking do 6: Filter out all the target NEs that do not pass the single entity classifier for R 7: Generate all possible NE pairs from the those that passed the single classifier 8: Filter out all the generated NE pairs that do not pass the entity pair classifier for R 9: Pick up the top N pairs with positive classification score to be qualified for generation
    Page 4, “Comparable Question Mining”
  2. Specifically, we utilize Latent Dirichlet Allocation ( LDA ) (Blei et al., 2003) to infer latent topics in texts.
    Page 4, “Online Question Generation”
  3. To train an LDA model, we constructed for each comparable relation a pseudo-document consisting of all questions that contain this relation in our corpus (the supporting questions).
    Page 4, “Online Question Generation”
  4. An additional product of the LDA training process is a topic distribution for each relation’s pseudo-document, which we consider as the relation’s context profile.
    Page 5, “Online Question Generation”
  5. Given a news article, a distribution over LDA topics is inferred from the article’s text using the trained model.
    Page 5, “Online Question Generation”
  6. The reason for this mistake is that many named entities appear as frequent terms in LDA topics, and thus mentioning many names that belong to a single topic drives LDA to assign this topic a high probability.
    Page 8, “Evaluation”
  7. Instead, we are interested in a higher level topical similarity to the input article, for which LDA topics were shown to help (Celikyilmaz et al., 2010).
    Page 8, “Related Work”

See all papers in Proc. ACL 2013 that mention LDA.

See all papers in Proc. ACL that mention LDA.

Back to top.

feature vector

Appears in 4 sentences as: feature vector (2) feature vectors (2)
In Generating Synthetic Comparable Questions for News Articles
  1. We next describe the various features we extract for every entity and the supervised models that given this feature vector representation assess the correctness of an instantiation.
    Page 5, “Online Question Generation”
  2. The feature vector of each named entity was induced as described in Section 4.2.1.
    Page 6, “Online Question Generation”
  3. To generate features for a candidate pair, we take the two feature vectors of the two entities and induce families of pair features by comparing between the two vectors.
    Page 6, “Online Question Generation”
  4. Figure 5: The entity pair features generated from two single entity feature vectors fa and fb
    Page 6, “Online Question Generation”

See all papers in Proc. ACL 2013 that mention feature vector.

See all papers in Proc. ACL that mention feature vector.

Back to top.

Mechanical Turk

Appears in 4 sentences as: Mechanical Turk (4)
In Generating Synthetic Comparable Questions for News Articles
  1. We tested the suggestions generated by our algorithm via a Mechanical Turk experiment, which showed a significant improvement over the strongest baseline of more than 45% in all metrics.
    Page 1, “Abstract”
  2. To test the performance of our algorithm, we conducted a Mechanical Turk experiment that assessed the quality of suggested questions for news articles on celebrities.
    Page 2, “Introduction”
  3. To evaluate our algorithm’s performance, we designed a Mechanical Turk (MTurk) experiment in which human annotators assess the quality of the questions that our algorithm generates for a sample of news articles.
    Page 6, “Evaluation”
  4. We assessed the performance of our algorithm via a Mechanical Turk experiment.
    Page 9, “Conclusions”

See all papers in Proc. ACL 2013 that mention Mechanical Turk.

See all papers in Proc. ACL that mention Mechanical Turk.

Back to top.

Relation Extraction

Appears in 4 sentences as: Relation Extraction (2) relation extraction (2)
In Generating Synthetic Comparable Questions for News Articles
  1. 3.1 Comparable Relation Extraction
    Page 3, “Comparable Question Mining”
  2. An important observation for the task of comparable relation extraction is that many relations are complex multiword expressions, and thus their automatic detection is not trivial.
    Page 3, “Comparable Question Mining”
  3. ger (Lafferty et al., 2001) to the task, since CRF was shown to be state-of-the-art for sequential relation extraction (Mooney and Bunescu, 2005; Culotta et al., 2006; J indal and Liu, 2006).
    Page 3, “Comparable Question Mining”
  4. Our extraction of comparable relations falls within the field of Relation Extraction , in which CRF is a state-of-the-art method (Mooney and Bunescu, 2005; Culotta et al., 2006).
    Page 8, “Related Work”

See all papers in Proc. ACL 2013 that mention Relation Extraction.

See all papers in Proc. ACL that mention Relation Extraction.

Back to top.