DEPEVAL(summ): Dependency-based Evaluation for Automatic Summaries
Owczarzak, Karolina

Article Structure

Abstract

This paper presents DEPEVAL(summ), a dependency-based metric for automatic evaluation of summaries.

Introduction

Evaluation is a crucial component in the area of automatic summarization; it is used both to rank multiple participant systems in shared summarization tasks, such as the Summarization track at Text Analysis Conference (TAC) 2008 and its Document Understanding Conference (DUC) predecessors, and to provide feedback to developers whose goal is to improve their summarization systems.

Current practice in summary evaluation

In the first Text Analysis Conference (TAC 2008), as well as its predecessor, the Document Understanding Conference (DUC) series, the evaluation

Lexical-Functional Grammar and the LFG parser

The method discussed in this paper rests on the assumptions of Lexical-Functional Grammar (Kaplan and Bresnan, 1982; Bresnan, 2001) (LFG).

Dependency-based evaluation

Our dependency-based evaluation method, similarly to BE, compares two unordered sets of dependencies: one bag contains dependencies harvested from the candidate summary and the other contains dependencies from one or more reference summaries.

Experimental results

The first question we have to ask is: which of the manual evaluation categories do we want our metric to imitate?

Discussion and future work

It is obvious that none of the versions performs best across the board; their different characteristics might render them better suited either for models or for automatic systems, but not for both at the same time.

Topics

evaluation metrics

Appears in 9 sentences as: evaluation metric (4) evaluation metrics (5)
In DEPEVAL(summ): Dependency-based Evaluation for Automatic Summaries
  1. In a test on TAC 2008 and DUC 2007 data, DEPEVAL(summ) achieves comparable or higher correlations with human judgments than the popular evaluation metrics ROUGE and Basic Elements (BE).
    Page 1, “Abstract”
  2. In this paper, we explore one such evaluation metric , DEPEVAL(summ), based on the comparison of Lexical-Functional Grammar (LFG) dependencies between a candidate summary and
    Page 1, “Introduction”
  3. Since this type of evaluation processes information in stages (constituent parser, dependency extraction, and the method of dependency matching between a candidate and a reference), there is potential for variance in performance among dependency-based evaluation metrics that use different components.
    Page 3, “Current practice in summary evaluation”
  4. In Owczarzak (2008), the method achieves equal or higher correlations with human judgments than METEOR (Banerjee and Lavie, 2005), one of the best-performing automatic MT evaluation metrics .
    Page 4, “Dependency-based evaluation”
  5. Of course, the ideal evaluation metric would show high correlations with human judgment on both levels.
    Page 5, “Experimental results”
  6. Table 1: System-level Pearson’s correlation between automatic and manual evaluation metrics for TAC 2008 data.
    Page 6, “Experimental results”
  7. Admittedly, we could just ignore this problem and focus on increasing correlations for automatic summaries only; after all, the whole point of creating evaluation metrics is to score and rank the output of systems.
    Page 7, “Discussion and future work”
  8. Since there is no single winner among all 32 variants of DEPEVAL(summ) on TAC 2008 data, we must decide which of the categories is most important to a successful automatic evaluation metric .
    Page 8, “Discussion and future work”
  9. This ties in with the purpose which the evaluation metric should serve.
    Page 8, “Discussion and future work”

See all papers in Proc. ACL 2009 that mention evaluation metrics.

See all papers in Proc. ACL that mention evaluation metrics.

Back to top.

human judgments

Appears in 9 sentences as: human judges (1) human judgment (3) human judgments (5)
In DEPEVAL(summ): Dependency-based Evaluation for Automatic Summaries
  1. In a test on TAC 2008 and DUC 2007 data, DEPEVAL(summ) achieves comparable or higher correlations with human judgments than the popular evaluation metrics ROUGE and Basic Elements (BE).
    Page 1, “Abstract”
  2. Despite relying on a the same concept, our approach outperforms BE in most comparisons, and it often achieves higher correlations with human judgments than the string-matching metric ROUGE (Lin, 2004).
    Page 1, “Introduction”
  3. Manual assessment, performed by human judges , usually centers around two main aspects of summary quality: content and form.
    Page 2, “Current practice in summary evaluation”
  4. In fact, when it comes to evaluation of automatic summaries, BE shows higher correlations with human judgments than ROUGE, although the difference is not large enough to be statistically significant.
    Page 3, “Current practice in summary evaluation”
  5. In Owczarzak (2008), the method achieves equal or higher correlations with human judgments than METEOR (Banerjee and Lavie, 2005), one of the best-performing automatic MT evaluation metrics.
    Page 4, “Dependency-based evaluation”
  6. In summary evaluation, as will be shown in Section 5, it leads to higher correlations with human judgments only in the case of human-produced model summaries, because almost any variation between two model summaries is “legal”, i.e.
    Page 4, “Dependency-based evaluation”
  7. For automatic summaries, which are of relatively poor quality, partial matching lowers our method’s ability to reflect human judgment , because it results in overly generous matching in situations where the examined information is neither a paraphrase nor relevant.
    Page 4, “Dependency-based evaluation”
  8. Of course, the ideal evaluation metric would show high correlations with human judgment on both levels.
    Page 5, “Experimental results”
  9. The letters in parenthesis indicate that a given DEPEVAL(summ) variant is significantly better at correlating with human judgment than ROUGE-2 (= R2), ROUGE-SU4 (= R4), or BE-HM (= B).
    Page 6, “Experimental results”

See all papers in Proc. ACL 2009 that mention human judgments.

See all papers in Proc. ACL that mention human judgments.

Back to top.

manual evaluation

Appears in 8 sentences as: Manual evaluation (1) manual evaluation (7)
In DEPEVAL(summ): Dependency-based Evaluation for Automatic Summaries
  1. However, manual evaluation of a large number of documents necessary for a relatively unbiased view is often unfeasible, especially in the contexts where repeated evaluations are needed.
    Page 1, “Introduction”
  2. A more detailed description of BE and ROUGE is presented in Section 2, which also gives an account of manual evaluation methods employed at TAC 2008.
    Page 1, “Introduction”
  3. Since manual evaluation is still the undisputed gold standard, both at TAC and DUC there was much effort to evaluate manually as much data as possible.
    Page 2, “Current practice in summary evaluation”
  4. 2.1 Manual evaluation
    Page 2, “Current practice in summary evaluation”
  5. Automatic metrics, because of their relative speed, can be applied more widely than manual evaluation .
    Page 2, “Current practice in summary evaluation”
  6. The first question we have to ask is: which of the manual evaluation categories do we want our metric to imitate?
    Page 5, “Experimental results”
  7. The Pyramid is, at the same time, a costly manual evaluation method, so an automatic metric that successfully emulates it would be a useful replacement.
    Page 5, “Experimental results”
  8. Table 1: System-level Pearson’s correlation between automatic and manual evaluation metrics for TAC 2008 data.
    Page 6, “Experimental results”

See all papers in Proc. ACL 2009 that mention manual evaluation.

See all papers in Proc. ACL that mention manual evaluation.

Back to top.

WordNet

Appears in 6 sentences as: WordNet (6)
In DEPEVAL(summ): Dependency-based Evaluation for Automatic Summaries
  1. We examine a number of variations of the method, including the addition of WordNet , partial matching, or removing relation labels from the dependencies.
    Page 1, “Abstract”
  2. This makes sense if we take into consideration that WordNet lists all possible synonyms for all possible senses of a word, and so, given a great number of cross-sentence comparisons in multi-sentence summaries, there is an increased risk of spurious matches between words which, despite being potentially synonymous in certain contexts, are not equivalent in the text.
    Page 4, “Dependency-based evaluation”
  3. The correlations are listed for the following versions of our method: pm - partial matching for dependencies; wn - WordNet ; pred - matching predicate-only dependencies; norel - ignoring dependency relation label; one - counting a match only once irrespective of how many instances of
    Page 5, “Experimental results”
  4. Three such variants are the baseline DEPEVAL(summ), the WordNet version DEPEVAL(summ) wn, and the version with removed relation labels DEPEVAL(summ) norel.
    Page 8, “Discussion and future work”
  5. The new implementation of BE presented at the TAC 2008 workshop (Tratz and Hovy, 2008) introduces transformations for dependencies in order to increase the number of matches among elements that are semantically similar yet differ in terms of syntactic structure and/or lexical choices, and adds WordNet for synonym matching.
    Page 8, “Discussion and future work”
  6. Since our method, presented in this paper, also uses the reranking parser, as well as WordNet , it would be interesting to compare both methods directly in terms of the performance of the dependency extraction procedure.
    Page 8, “Discussion and future work”

See all papers in Proc. ACL 2009 that mention WordNet.

See all papers in Proc. ACL that mention WordNet.

Back to top.

reranking

Appears in 5 sentences as: reranking (5)
In DEPEVAL(summ): Dependency-based Evaluation for Automatic Summaries
  1. Using a reranking parser and a Lexical-Functional Grammar (LFG) annotation, we produce a set of dependency triples for each summary.
    Page 1, “Abstract”
  2. (2004) applied to the output of the reranking parser of Chamiak and Johnson (2005), whereas in BE (in the version presented here) dependencies are generated by the Minipar parser (Lin, 1995).
    Page 1, “Introduction”
  3. First, a summary is parsed with the Charniak—Johnson reranking parser (Chamiak and Johnson, 2005) to obtain the phrase-structure tree.
    Page 3, “Lexical-Functional Grammar and the LFG parser”
  4. Its core modules were updated as well: Minipar was replaced with the Charniak—Johnson reranking parser (Charniak and Johnson, 2005), Named Entity identification was added, and the BE extraction is conducted using a set of Tregex rules (Levy and Andrew, 2006).
    Page 8, “Discussion and future work”
  5. Since our method, presented in this paper, also uses the reranking parser, as well as WordNet, it would be interesting to compare both methods directly in terms of the performance of the dependency extraction procedure.
    Page 8, “Discussion and future work”

See all papers in Proc. ACL 2009 that mention reranking.

See all papers in Proc. ACL that mention reranking.

Back to top.

precision and recall

Appears in 3 sentences as: precision and recall (3)
In DEPEVAL(summ): Dependency-based Evaluation for Automatic Summaries
  1. (2004) obtains high precision and recall rates.
    Page 3, “Lexical-Functional Grammar and the LFG parser”
  2. Overlap between the candidate bag and the reference bag is calculated in the form of precision, recall, and the f-measure (with precision and recall equally weighted).
    Page 4, “Dependency-based evaluation”
  3. Much like the inverse relation of precision and recall , changes and additions that improve a metric’s correlation with human scores for model summaries often weaken the correlation for system summaries, and vice versa.
    Page 7, “Discussion and future work”

See all papers in Proc. ACL 2009 that mention precision and recall.

See all papers in Proc. ACL that mention precision and recall.

Back to top.

word order

Appears in 3 sentences as: word order (3)
In DEPEVAL(summ): Dependency-based Evaluation for Automatic Summaries
  1. rely solely on the surface sequence of words to determine similarity between summaries, but delves into what could be called a shallow semantic structure, comprising thematic roles such as subject and object, it is likely to notice identity of meaning where such identity is obscured by variations in word order .
    Page 3, “Current practice in summary evaluation”
  2. C-structure represents the word order of the surface string and the hierarchical organisation of phrases in terms of trees.
    Page 3, “Lexical-Functional Grammar and the LFG parser”
  3. As a result, then, a lot of effort was put into developing metrics that can identify similar content despite non-similar form, which naturally led to the application of linguistically-oriented approaches that look beyond surface word order .
    Page 7, “Discussion and future work”

See all papers in Proc. ACL 2009 that mention word order.

See all papers in Proc. ACL that mention word order.

Back to top.