Discourse Complements Lexical Semantics for Non-factoid Answer Reranking
Jansen, Peter and Surdeanu, Mihai and Clark, Peter

Article Structure

Abstract

We propose a robust answer reranking model for non-factoid questions that integrates lexical semantics with discourse information, driven by two representations of discourse: a shallow representation centered around discourse markers, and a deep one based on Rhetorical Structure Theory.

Introduction

Driven by several international evaluations and workshops such as the Text REtrieval Conference (TREC)1 and the Cross Language Evaluation Forum (CLEF),2 the task of question answering (QA) has received considerable attention.

Related Work

The body of work on factoid QA is too broad to be discussed here (see, e. g., the TREC workshops for an overview).

Approach

The proposed answer reranking component is embedded in the QA framework illustrated in Figure 1.

Models and Features

We propose two separate discourse representation schemes — one shallow, centered around discourse markers, and one deep, based on RST.

Experiments

5.1 Data

CR + LS + DMM + DPM 39.32* +24% 47.86* +20%

Table 1: Overall results across three datasets.

Topics

reranking

Appears in 24 sentences as: rerank (2) reranker (4) rerankers (1) reranking (18)
In Discourse Complements Lexical Semantics for Non-factoid Answer Reranking
  1. We propose a robust answer reranking model for non-factoid questions that integrates lexical semantics with discourse information, driven by two representations of discourse: a shallow representation centered around discourse markers, and a deep one based on Rhetorical Structure Theory.
    Page 1, “Abstract”
  2. We propose a novel answer reranking (AR) model that combines lexical semantics (LS) with discourse information, driven by two representations of discourse: a shallow representation centered around discourse markers and surface text information, and a deep one based on the Rhetorical Structure Theory (RST) discourse framework (Mann and Thompson, 1988).
    Page 1, “Introduction”
  3. First, most NF QA approaches tend to use multiple similarity models (information retrieval or alignment) as features in discriminative rerankers (Riezler et al., 2007; Higashinaka and Isozaki, 2008; Verberne et al., 2010; Surdeanu et al., 2011).
    Page 2, “Related Work”
  4. (2011) extracted 47 cue phrases such as because from a small collection of web documents, and used the cosine similarity between an answer candidate and a bag of words containing these cue phrases as a single feature in their reranking model for non-factoid why QA.
    Page 2, “Related Work”
  5. This classifier was then used to extract instances of causal relations in answer candidates, which were turned into features in a reranking model for J apanense why QA.
    Page 2, “Related Work”
  6. Figure 1: Architecture of the reranking framework for QA.
    Page 2, “Related Work”
  7. The proposed answer reranking component is embedded in the QA framework illustrated in Figure 1.
    Page 2, “Approach”
  8. CQA: In this scenario, the task is defined as reranking all the user-posted answers for a particular question to boost the community-selected best answer to the top position.
    Page 2, “Approach”
  9. These answer candidates are then passed to the answer reranking component, the focus of this work.
    Page 3, “Approach”
  10. AR analyzes the candidates using more expensive techniques to extract discourse and LS features (detailed in §4), and these features are then used in concert with a learning framework to rerank the candidates and elevate correct answers to higher positions.
    Page 3, “Approach”
  11. For the learning framework, we used SVank, a variant of Support Vector Machines for structured output adapted to ranking problems.7 In addition to these features, each reranker also includes a single feature containing the score of each candidate, as computed by the above candidate retrieval (CR) component.8
    Page 3, “Approach”

See all papers in Proc. ACL 2014 that mention reranking.

See all papers in Proc. ACL that mention reranking.

Back to top.

cosine similarity

Appears in 12 sentences as: cosine similarities (1) cosine similarity (11)
In Discourse Complements Lexical Semantics for Non-factoid Answer Reranking
  1. (2011) extracted 47 cue phrases such as because from a small collection of web documents, and used the cosine similarity between an answer candidate and a bag of words containing these cue phrases as a single feature in their reranking model for non-factoid why QA.
    Page 2, “Related Work”
  2. This is a commonly used setup in the CQA community (Wang et al., 2009).4 Thus, for a given question, all its answers are fetched from the answer collection, and an initial ranking is constructed based on the cosine similarity between theirs and the question’s lemma vector representations, with lemmas weighted using tfidf (Ch.
    Page 2, “Approach”
  3. The candidate answers are scored using a linear interpolation of two cosine similarity scores: one between the entire parent document and question (to model global context), and a second between the answer candidate and question (for local context).6 Because the number of answer candidates is typically large (e.g., equal to the number of paragraphs in the textbook), we return the N top candidates with the highest scores.
    Page 3, “Approach”
  4. 6We empirically observed that this combination of scores performs better than using solely the cosine similarity between the answer and question.
    Page 3, “Models and Features”
  5. If text before or after a marker out to a given sentence range matches the entire text of the question (with a cosine similarity score larger than a threshold), that argument takes on the label QSEG, or OTHER otherwise.
    Page 3, “Models and Features”
  6. The values of the discourse features are the mean of the similarity scores (e. g., cosine similarity using tfidf weighting) of the two marker arguments and the corresponding question.
    Page 3, “Models and Features”
  7. For example, the value of the QSEG BY QSEG SR1 feature in Figure 2 is the average of the cosine similarities of the question text with the answer texts before/after by out to a distance of one sentence before/after the marker.
    Page 3, “Models and Features”
  8. Similar to the DMM, these features take real values obtained by averaging the cosine similarity of the arguments with the question content.9 Fig.
    Page 4, “Models and Features”
  9. didate, which is computed as the cosine similarity between the two composite vectors of the question and the answer candidate.
    Page 5, “Models and Features”
  10. Both this overall similarity score, as well as the average pairwise cosine similarity between each word in the question and answer candidate, serve as features.
    Page 5, “Models and Features”
  11. The following hyper parameters were tuned using grid search to maximize P@1 on each development partition: (a) the segment matching thresholds that determine the minimum cosine similarity between an answer segment and a question for the segment to be labeled QSEG; and (b)
    Page 5, “Experiments”

See all papers in Proc. ACL 2014 that mention cosine similarity.

See all papers in Proc. ACL that mention cosine similarity.

Back to top.

lexical semantic

Appears in 12 sentences as: Lexical semantic (1) lexical semantic (5) Lexical Semantics (2) lexical semantics (5)
In Discourse Complements Lexical Semantics for Non-factoid Answer Reranking
  1. We propose a robust answer reranking model for non-factoid questions that integrates lexical semantics with discourse information, driven by two representations of discourse: a shallow representation centered around discourse markers, and a deep one based on Rhetorical Structure Theory.
    Page 1, “Abstract”
  2. We experimentally demonstrate that the discourse structure of non-factoid answers provides information that is complementary to lexical semantic similarity between question and answer, improving performance up to 24% (relative) over a state-of-the-art model that exploits lexical semantic similarity alone.
    Page 1, “Abstract”
  3. We propose a novel answer reranking (AR) model that combines lexical semantics (LS) with discourse information, driven by two representations of discourse: a shallow representation centered around discourse markers and surface text information, and a deep one based on the Rhetorical Structure Theory (RST) discourse framework (Mann and Thompson, 1988).
    Page 1, “Introduction”
  4. Inspired by this previous work and recent work in discourse parsing (Feng and Hirst, 2012), our work is the first to systematically explore structured discourse features driven by several discourse representations, combine discourse with lexical semantic models, and evaluate these representations on thousands of questions using both in-domain and cross-domain experiments.
    Page 2, “Related Work”
  5. 4.3 Lexical Semantics Model
    Page 4, “Models and Features”
  6. (2013), we include lexical semantics in our reranking model.
    Page 4, “Models and Features”
  7. Lexical Semantics : We trained two different RNNLMs for this work.
    Page 5, “Experiments”
  8. Lexical semantic features increase performance for all settings, but demonstrate far more utility to the open-domain YA corpus.
    Page 6, “CR + LS + DMM + DPM 39.32* +24% 47.86* +20%”
  9. This disparity is likely due to the difficulty in assembling LS training data at an appropriate level for the biology corpus, contrasted with the relative abundance of large scale open-domain lexical semantic resources.
    Page 6, “CR + LS + DMM + DPM 39.32* +24% 47.86* +20%”
  10. For the YA corpus, where lexical semantics showed the most benefit, simply adding
    Page 6, “CR + LS + DMM + DPM 39.32* +24% 47.86* +20%”
  11. Empirically we show that modeling answer discourse structures is complementary to modeling lexical semantic similarity and that the best performance is obtained when they are tightly integrated.
    Page 9, “CR + LS + DMM + DPM 39.32* +24% 47.86* +20%”

See all papers in Proc. ACL 2014 that mention lexical semantic.

See all papers in Proc. ACL that mention lexical semantic.

Back to top.

discourse parser

Appears in 10 sentences as: Discourse Parser (2) discourse parser (4) discourse parsers (1) discourse parsing (4)
In Discourse Complements Lexical Semantics for Non-factoid Answer Reranking
  1. In terms of discourse parsing , Verberne et al.
    Page 2, “Related Work”
  2. Discourse Parser (deep)
    Page 2, “Related Work”
  3. They later concluded that while discourse parsing appears to be useful for QA, automated discourse parsing tools are required before this approach can be tested at scale (Verbeme et al., 2010).
    Page 2, “Related Work”
  4. Inspired by this previous work and recent work in discourse parsing (Feng and Hirst, 2012), our work is the first to systematically explore structured discourse features driven by several discourse representations, combine discourse with lexical semantic models, and evaluate these representations on thousands of questions using both in-domain and cross-domain experiments.
    Page 2, “Related Work”
  5. 4.2 Discourse Parser Model
    Page 4, “Models and Features”
  6. The discourse parser model (DPM) is based on the RST discourse framework (Mann and Thompson, 1988).
    Page 4, “Models and Features”
  7. However, this also introduces noise because discourse analysis is a complex task and discourse parsers are not perfect.
    Page 4, “Models and Features”
  8. This suggests that, in the domains explored here, there is a degree of noise introduced by the discourse parser , and the simple features proposed here are the best strategy to avoid overfitting on it.
    Page 4, “Models and Features”
  9. Due to the speed limitations of the discourse parser , we randomly drew 10,000 QA pairs from the corpus of how questions described by Surdeanu et al.
    Page 5, “Experiments”
  10. This is a motivating result for discourse analysis, especially considering that the discourse parser was trained on a domain different from the corpora used here.
    Page 6, “CR + LS + DMM + DPM 39.32* +24% 47.86* +20%”

See all papers in Proc. ACL 2014 that mention discourse parser.

See all papers in Proc. ACL that mention discourse parser.

Back to top.

discourse structure

Appears in 4 sentences as: discourse structure (4)
In Discourse Complements Lexical Semantics for Non-factoid Answer Reranking
  1. We experimentally demonstrate that the discourse structure of non-factoid answers provides information that is complementary to lexical semantic similarity between question and answer, improving performance up to 24% (relative) over a state-of-the-art model that exploits lexical semantic similarity alone.
    Page 1, “Abstract”
  2. Driven by this observation, our main hypothesis is that the discourse structure of NF answers provides complementary information to state-of-the-art QA models that measure the similarity (either lexical and/or semantic) between question and answer.
    Page 1, “Introduction”
  3. Argument labels indicate only if lemmas from the question were found in a discourse structure present in an answer candidate, and do not speak to the specific lemmas that were found.
    Page 3, “Models and Features”
  4. Second, these features model the intensity of the match between the text surrounding the discourse structure and the question text using both the assigned argument labels and the feature values.
    Page 4, “Models and Features”

See all papers in Proc. ACL 2014 that mention discourse structure.

See all papers in Proc. ACL that mention discourse structure.

Back to top.

similarity score

Appears in 4 sentences as: similarity score (2) similarity scores (2)
In Discourse Complements Lexical Semantics for Non-factoid Answer Reranking
  1. The candidate answers are scored using a linear interpolation of two cosine similarity scores : one between the entire parent document and question (to model global context), and a second between the answer candidate and question (for local context).6 Because the number of answer candidates is typically large (e.g., equal to the number of paragraphs in the textbook), we return the N top candidates with the highest scores.
    Page 3, “Approach”
  2. If text before or after a marker out to a given sentence range matches the entire text of the question (with a cosine similarity score larger than a threshold), that argument takes on the label QSEG, or OTHER otherwise.
    Page 3, “Models and Features”
  3. The values of the discourse features are the mean of the similarity scores (e. g., cosine similarity using tfidf weighting) of the two marker arguments and the corresponding question.
    Page 3, “Models and Features”
  4. Both this overall similarity score , as well as the average pairwise cosine similarity between each word in the question and answer candidate, serve as features.
    Page 5, “Models and Features”

See all papers in Proc. ACL 2014 that mention similarity score.

See all papers in Proc. ACL that mention similarity score.

Back to top.

statistical significance

Appears in 4 sentences as: statistical significance (2) statistically significant (2)
In Discourse Complements Lexical Semantics for Non-factoid Answer Reranking
  1. Our results show statistically significant improvements of up to 24% on top of state-of-the-art LS models (Yih et al., 2013).
    Page 1, “Introduction”
  2. The transferred models always outperform the baselines, but only the ensemble model’s improvement is statistically significant .
    Page 8, “CR + LS + DMM + DPM 39.32* +24% 47.86* +20%”
  3. The results of the transferred models that include LS features are slightly lower, but still approach statistical significance for P@1 and are significant for MRR.
    Page 8, “CR + LS + DMM + DPM 39.32* +24% 47.86* +20%”
  4. T indicates approaching statistical significance with p = 0.07 or 0.06.
    Page 8, “CR + LS + DMM + DPM 39.32* +24% 47.86* +20%”

See all papers in Proc. ACL 2014 that mention statistical significance.

See all papers in Proc. ACL that mention statistical significance.

Back to top.

SVM

Appears in 4 sentences as: SVM (5)
In Discourse Complements Lexical Semantics for Non-factoid Answer Reranking
  1. For all experiments we used a linear SVM kernel.15
    Page 6, “CR + LS + DMM + DPM 39.32* +24% 47.86* +20%”
  2. Table 4 shows that of the highest-weighted SVM features learned when training models for HOW questions on YA and Bio, many are shared (e.g., 56.5% of the features in the top half of both DPMs are shared), suggesting that a core set of discourse features may be of utility across domains.
    Page 7, “CR + LS + DMM + DPM 39.32* +24% 47.86* +20%”
  3. Table 4: Percentage of top features with the highest SVM weights that are shared between Bio HOW and YA models.
    Page 8, “CR + LS + DMM + DPM 39.32* +24% 47.86* +20%”
  4. These experiments were performed in several groups: both with and without LS features, as well as using either a single SVM or an ensemble model that linearly interpolates the predictions of two SVM classifiers (one each for DMM and DPM).16 The results are summarized in Table 5.
    Page 8, “CR + LS + DMM + DPM 39.32* +24% 47.86* +20%”

See all papers in Proc. ACL 2014 that mention SVM.

See all papers in Proc. ACL that mention SVM.

Back to top.

EDUs

Appears in 3 sentences as: EDUs (3)
In Discourse Complements Lexical Semantics for Non-factoid Answer Reranking
  1. Note that our marker arguments are akin to EDUs in RST, but, in this shallow representation, they are simply constructed around discourse markers and bound by an arbitrary sentence range.
    Page 3, “Models and Features”
  2. In RST, the text is segmented into a sequence of non-overlapping fragments called elementary discourse units ( EDUs ), and binary discourse relations recursively connect neighboring units.
    Page 4, “Models and Features”
  3. To tease apart the relative contribution of discourse features that occur only within a single sentence versus features that span multiple sentences, we examined the performance of the full model when using only intra-sentence features, i.e., SRO features for DMM, and features based on discourse relations where both EDUs appear in the same sentence for DPM, versus the full intersen-tence models.
    Page 7, “CR + LS + DMM + DPM 39.32* +24% 47.86* +20%”

See all papers in Proc. ACL 2014 that mention EDUs.

See all papers in Proc. ACL that mention EDUs.

Back to top.

F1 score

Appears in 3 sentences as: F1 score (3)
In Discourse Complements Lexical Semantics for Non-factoid Answer Reranking
  1. We adapted them to this dataset by weighing each answer by its overlap with gold answers, where overlap is measured as the highest F1 score between the candidate and a gold answer.
    Page 6, “CR + LS + DMM + DPM 39.32* +24% 47.86* +20%”
  2. Thus, P@1 reduces to this F1 score for the top answer.
    Page 6, “CR + LS + DMM + DPM 39.32* +24% 47.86* +20%”
  3. For example, if the best answer for a question appears at rank 2 with an F1 score of 0.3, the corresponding MRR score is 0.3 / 2.
    Page 6, “CR + LS + DMM + DPM 39.32* +24% 47.86* +20%”

See all papers in Proc. ACL 2014 that mention F1 score.

See all papers in Proc. ACL that mention F1 score.

Back to top.

in-domain

Appears in 3 sentences as: in-domain (4)
In Discourse Complements Lexical Semantics for Non-factoid Answer Reranking
  1. Inspired by this previous work and recent work in discourse parsing (Feng and Hirst, 2012), our work is the first to systematically explore structured discourse features driven by several discourse representations, combine discourse with lexical semantic models, and evaluate these representations on thousands of questions using both in-domain and cross-domain experiments.
    Page 2, “Related Work”
  2. The ensemble model without LS (third line) has a nearly identical P@1 score as the equivalent in-domain model (line 13 in Table 1), while slightly surpassing in-domain MRR performance.
    Page 8, “CR + LS + DMM + DPM 39.32* +24% 47.86* +20%”
  3. The in-domain performance of the ensemble model is similar to that of the single classifier in both YA and Bio HOW so we omit these results here for simplicity.
    Page 8, “CR + LS + DMM + DPM 39.32* +24% 47.86* +20%”

See all papers in Proc. ACL 2014 that mention in-domain.

See all papers in Proc. ACL that mention in-domain.

Back to top.

language model

Appears in 3 sentences as: language model (2) language models (1)
In Discourse Complements Lexical Semantics for Non-factoid Answer Reranking
  1. (2013) recently addressed the problem of answer sentence selection and demonstrated that LS models, including recurrent neural network language models (RNNLM), have a higher contribution to overall performance than exploiting syntactic analysis.
    Page 2, “Related Work”
  2. In particular, we use the recurrent neural network language model (RNNLM) of Mikolov et al.
    Page 4, “Models and Features”
  3. Like any language model , a RNNLM estimates the probability of observing a word given the preceding context, but, in this process, it learns word embeddings into a latent, conceptual space with a fixed number of dimensions.
    Page 4, “Models and Features”

See all papers in Proc. ACL 2014 that mention language model.

See all papers in Proc. ACL that mention language model.

Back to top.

semantic similarity

Appears in 3 sentences as: semantic similarity (4)
In Discourse Complements Lexical Semantics for Non-factoid Answer Reranking
  1. We experimentally demonstrate that the discourse structure of non-factoid answers provides information that is complementary to lexical semantic similarity between question and answer, improving performance up to 24% (relative) over a state-of-the-art model that exploits lexical semantic similarity alone.
    Page 1, “Abstract”
  2. This way, the DMM and DPM features jointly capture discourse structures and semantic similarity between answer segments and question.
    Page 8, “CR + LS + DMM + DPM 39.32* +24% 47.86* +20%”
  3. Empirically we show that modeling answer discourse structures is complementary to modeling lexical semantic similarity and that the best performance is obtained when they are tightly integrated.
    Page 9, “CR + LS + DMM + DPM 39.32* +24% 47.86* +20%”

See all papers in Proc. ACL 2014 that mention semantic similarity.

See all papers in Proc. ACL that mention semantic similarity.

Back to top.

Treebank

Appears in 3 sentences as: Treebank (3)
In Discourse Complements Lexical Semantics for Non-factoid Answer Reranking
  1. RST Treebank
    Page 2, “Related Work”
  2. performance on a small sample of seven WSJ articles drawn from the RST Treebank (Carlson et al., 2003).
    Page 2, “Related Work”
  3. Note that, because these domains are considerably different from the RST Treebank , the parser fails to produce a tree on a large number of answer candidates: 6.2% for YA, and 41.1% for Bio.
    Page 5, “Experiments”

See all papers in Proc. ACL 2014 that mention Treebank.

See all papers in Proc. ACL that mention Treebank.

Back to top.