Automatic Evaluation of Linguistic Quality in Multi-Document Summarization
Pitler, Emily and Louis, Annie and Nenkova, Ani

Article Structure

Abstract

To date, few attempts have been made to develop and validate methods for automatic evaluation of linguistic quality in text summarization.

Introduction

Efforts for the development of automatic text sum-marizers have focused almost exclusively on improving content selection capabilities of systems, ignoring the linguistic quality of the system output.

Aspects of linguistic quality

We focus on the five aspects of linguistic quality that were used to evaluate summaries in DUC: grammaticality, non-redundancy, referential clarity, focus, and structure/coherence.3 For each of the questions, all summaries were manually rated on a scale from 1 to 5, in which 5 is the best.

Indicators of linguistic quality

Multiple factors influence the linguistic quality of text in general, including: word choice, the reference form of entities, and local coherence.

Summarization data

For our experiments, we use data from the multi-document summarization tasks of the Document Understanding Conference (DUC) workshops (Over et al., 2007).

Experimental setup

We use the summaries from DUC 2006 for training and feature development and DUC 2007 served as the test set.

Results and discussion

6.1 System-level evaluation

Conclusion

We have presented an analysis of a wide variety of features for the linguistic quality of summaries.

Topics

coreference

Appears in 12 sentences as: Coref (1) Coreference (1) coreference (10)
In Automatic Evaluation of Linguistic Quality in Multi-Document Summarization
  1. Focus, coherence and referential clarity are best evaluated by a class of features measuring local coherence on the basis of cosine similarity between sentences, coreference information, and summarization specific features.
    Page 1, “Abstract”
  2. This class of linguistic quality indicators is a combination of factors related to coreference , adjacent sentence similarity, and summary-specific context of surface cohesive devices.
    Page 3, “Indicators of linguistic quality”
  3. Coreference Steinberger et al.
    Page 4, “Indicators of linguistic quality”
  4. (2007) compare the coreference chains in input documents and in summaries in order to locate potential problems.
    Page 4, “Indicators of linguistic quality”
  5. We instead define a set of more general features related to coreference that are not specific to summarization and are applicable for any text.
    Page 4, “Indicators of linguistic quality”
  6. Automatic coreference systems are trained on human-produced texts and we expect their accuracies to drop when applied to automatically generated summaries.
    Page 4, “Indicators of linguistic quality”
  7. The tool does not perform full coreference resolution.
    Page 5, “Indicators of linguistic quality”
  8. For all four other questions, the best feature set is Continuity, which is a combination of summarization specific features, coreference features and cosine similarity of adjacent sentences.
    Page 7, “Results and discussion”
  9. We now investigate to what extent each of its components—summary-specific features, coreference , and cosine similarity between adjacent sentences—contribute to performance.
    Page 8, “Results and discussion”
  10. However, the coreference features do not seem to contribute much towards predicting summary linguistic quality.
    Page 8, “Results and discussion”
  11. The accuracies of the Continuity class are not affected at all when these coreference features are not included.
    Page 8, “Results and discussion”

See all papers in Proc. ACL 2010 that mention coreference.

See all papers in Proc. ACL that mention coreference.

Back to top.

language models

Appears in 12 sentences as: Language model (1) language model (3) Language Modeling (1) Language models (1) language models (6)
In Automatic Evaluation of Linguistic Quality in Multi-Document Summarization
  1. 3.1 Word choice: language models
    Page 2, “Indicators of linguistic quality”
  2. Language models (LM) are a way of computing how familiar a text is to readers using the distribution of words from a large background corpus.
    Page 2, “Indicators of linguistic quality”
  3. We built unigram, bigram, and trigram language models with Good-Turing smoothing over the New York Times (NYT) section of the English Gigaword corpus (over 900 million words).
    Page 2, “Indicators of linguistic quality”
  4. We used the SRI Language Modeling Toolkit (Stol-cke, 2002) for this purpose.
    Page 2, “Indicators of linguistic quality”
  5. For each of the three ngram language models , we include the min, max, and average log probability of the sentences contained in a summary, as well as the overall log probability of the entire summary.
    Page 2, “Indicators of linguistic quality”
  6. We expect that these structural features will be better at detecting ungrammatical sentences than the local language model features.
    Page 4, “Indicators of linguistic quality”
  7. Word coherence can be considered as the analog of language models at the inter-sentence level.
    Page 5, “Indicators of linguistic quality”
  8. Coh-Metrix, which has been proposed as a comprehensive characterization of text, does not perform as well as the language model and the entity coherence classes, which contain considerably fewer features related to only one aspect of text.
    Page 7, “Results and discussion”
  9. It is apparent from the results that continuity, entity coherence, sentence fluency and language models are the most powerful classes of features that should be used in automation of evaluation and against which novel predictors of text quality should be compared.
    Page 7, “Results and discussion”
  10. For example, the language model features, which are the second best class for the system-level, do not fare as well at the input-level.
    Page 8, “Results and discussion”
  11. While for the machines Continuity feature class is the best predictor of referential clarity, focus, and structure (Table 3), for humans, language models and sentence fluency are best for
    Page 9, “Results and discussion”

See all papers in Proc. ACL 2010 that mention language models.

See all papers in Proc. ACL that mention language models.

Back to top.

named entities

Appears in 10 sentences as: Named ent (3) Named entities (1) named entities (4) Named Entity (1) named entity (1)
In Automatic Evaluation of Linguistic Quality in Multi-Document Summarization
  1. 3.2 Reference form: Named entities
    Page 2, “Indicators of linguistic quality”
  2. This set of features examines whether named entities have informative descriptions in the summary.
    Page 2, “Indicators of linguistic quality”
  3. We focus on named entities because they appear often in summaries of news documents and are often not known to the reader beforehand.
    Page 2, “Indicators of linguistic quality”
  4. We run the Stanford Named Entity Recognizer (Finkel et al., 2005) and record the number of PERSONS, ORGANIZATIONS, and LOCATIONS.
    Page 2, “Indicators of linguistic quality”
  5. For each type of named entity (PERSON, ORGANIZATION, LOCATION), we separately record the number of instances which appear as first mentions in the summary but correspond to non-first mentions in the source documents.
    Page 3, “Indicators of linguistic quality”
  6. Some summaries might not include people and other named entities at all.
    Page 3, “Indicators of linguistic quality”
  7. Named ent .
    Page 7, “Results and discussion”
  8. The classes of features specific to named entities and noun phrase syntax are the weakest predictors.
    Page 7, “Results and discussion”
  9. Named ent .
    Page 8, “Results and discussion”
  10. Named ent .
    Page 9, “Results and discussion”

See all papers in Proc. ACL 2010 that mention named entities.

See all papers in Proc. ACL that mention named entities.

Back to top.

cosine similarity

Appears in 6 sentences as: Cosine similarity (2) cosine similarity (5)
In Automatic Evaluation of Linguistic Quality in Multi-Document Summarization
  1. Focus, coherence and referential clarity are best evaluated by a class of features measuring local coherence on the basis of cosine similarity between sentences, coreference information, and summarization specific features.
    Page 1, “Abstract”
  2. Cosine similarity We use cosine similarity to compute the overlap of words in adjacent sentences s,- and 3H1 as a measure of continuity.
    Page 4, “Indicators of linguistic quality”
  3. We compute the min, max, and average value of cosine similarity over the entire summary.
    Page 4, “Indicators of linguistic quality”
  4. Cosine similarity is thus indicative of both continuity and redundancy.
    Page 4, “Indicators of linguistic quality”
  5. For all four other questions, the best feature set is Continuity, which is a combination of summarization specific features, coreference features and cosine similarity of adjacent sentences.
    Page 7, “Results and discussion”
  6. We now investigate to what extent each of its components—summary-specific features, coreference, and cosine similarity between adjacent sentences—contribute to performance.
    Page 8, “Results and discussion”

See all papers in Proc. ACL 2010 that mention cosine similarity.

See all papers in Proc. ACL that mention cosine similarity.

Back to top.

NIST

Appears in 6 sentences as: NIST (6)
In Automatic Evaluation of Linguistic Quality in Multi-Document Summarization
  1. We train and test linguistic quality models on consecutive years of NIST evaluation data in order to show the generality of results.
    Page 1, “Abstract”
  2. quality and none have been validated on data from NIST evaluations.
    Page 1, “Introduction”
  3. We evaluate the predictive power of these linguistic quality metrics by training and testing models on consecutive years of NIST evaluations (data described
    Page 1, “Introduction”
  4. In both DUC 2006 and DUC 2007, ten NIST assessors wrote summaries for the various inputs.
    Page 9, “Results and discussion”
  5. We only report results on the input level, as we are interested in distinguishing between the quality of the summaries, not the NIST assessors’ writing skills.
    Page 9, “Results and discussion”
  6. Automatic evaluation will make testing easier during system development and enable reporting results obtained outside of the cycles of NIST evaluation.
    Page 9, “Conclusion”

See all papers in Proc. ACL 2010 that mention NIST.

See all papers in Proc. ACL that mention NIST.

Back to top.

Feature set

Appears in 5 sentences as: Feature set (3) feature set (1) feature sets (1)
In Automatic Evaluation of Linguistic Quality in Multi-Document Summarization
  1. For all four other questions, the best feature set is Continuity, which is a combination of summarization specific features, coreference features and cosine similarity of adjacent sentences.
    Page 7, “Results and discussion”
  2. Feature set Gram.
    Page 7, “Results and discussion”
  3. Feature set Gram.
    Page 8, “Results and discussion”
  4. Note however that the relative performance of the feature sets changes between the machine and human results.
    Page 9, “Results and discussion”
  5. Feature set Gram.
    Page 9, “Results and discussion”

See all papers in Proc. ACL 2010 that mention Feature set.

See all papers in Proc. ACL that mention Feature set.

Back to top.

noun phrases

Appears in 5 sentences as: Noun phrases (1) noun phrases (4)
In Automatic Evaluation of Linguistic Quality in Multi-Document Summarization
  1. Referential clarity: It should be easy to identify who or what the pronouns and noun phrases in the summary are referring to.
    Page 2, “Aspects of linguistic quality”
  2. In this class, we include features that reflect the modification properties of noun phrases (NPs) in the summary that are first mentions to people.
    Page 3, “Indicators of linguistic quality”
  3. Noun phrases can include pre-modifiers, apposi-tives, prepositional phrases, etc.
    Page 3, “Indicators of linguistic quality”
  4. These include sentence length, number of fragments, average lengths of the a’iflerent types of syntactic phrases, total length of modifiers in noun phrases , and various other syntactic features.
    Page 4, “Indicators of linguistic quality”
  5. Instead, noun phrases are considered to refer to the same entity if their heads are identical.
    Page 5, “Indicators of linguistic quality”

See all papers in Proc. ACL 2010 that mention noun phrases.

See all papers in Proc. ACL that mention noun phrases.

Back to top.

machine translation

Appears in 4 sentences as: machine translated (1) Machine Translation (1) machine translation (2)
In Automatic Evaluation of Linguistic Quality in Multi-Document Summarization
  1. For this reason, LMs are widely used in applications such as generation and machine translation to guide the production of sentences.
    Page 2, “Indicators of linguistic quality”
  2. These features are weakly but significantly correlated with the fluency of machine translated sentences.
    Page 4, “Indicators of linguistic quality”
  3. Soricut and Marcu (2006) make an analogy to machine translation : two words are likely to be translations of each other if they often appear in parallel sentences; in texts, two words are likely to signal local coherence if they often appear in adjacent sentences.
    Page 5, “Indicators of linguistic quality”
  4. For example, at the 2008 ACL Workshop on Statistical Machine Translation , all fifteen automatic evaluation metrics, including variants of BLEU scores, achieved between 42% and 56% pairwise accuracy with human judgments at the sentence level (Callison-Burch et al., 2008).
    Page 8, “Results and discussion”

See all papers in Proc. ACL 2010 that mention machine translation.

See all papers in Proc. ACL that mention machine translation.

Back to top.

evaluation metrics

Appears in 3 sentences as: evaluation metrics (3)
In Automatic Evaluation of Linguistic Quality in Multi-Document Summarization
  1. Good performance on this task is the most desired property of evaluation metrics during system development.
    Page 1, “Introduction”
  2. These input-level accuracies compare favorably with automatic evaluation metrics for other natural language processing tasks.
    Page 8, “Results and discussion”
  3. For example, at the 2008 ACL Workshop on Statistical Machine Translation, all fifteen automatic evaluation metrics , including variants of BLEU scores, achieved between 42% and 56% pairwise accuracy with human judgments at the sentence level (Callison-Burch et al., 2008).
    Page 8, “Results and discussion”

See all papers in Proc. ACL 2010 that mention evaluation metrics.

See all papers in Proc. ACL that mention evaluation metrics.

Back to top.

SVM

Appears in 3 sentences as: SVM (3)
In Automatic Evaluation of Linguistic Quality in Multi-Document Summarization
  1. We use a Ranking SVM (Si/Mug“ (Joachims, 2002)) to score summaries using our features.
    Page 6, “Experimental setup”
  2. The Ranking SVM seeks to minimize the number of discordant pairs (pairs in which the gold standard has :31 ranked strictly higher than :52, but the learner ranks x2 strictly higher than :01).
    Page 6, “Experimental setup”
  3. For system-level evaluation, we treat the real-valued output of the SVM ranker for each summary as the linguistic quality score.
    Page 7, “Experimental setup”

See all papers in Proc. ACL 2010 that mention SVM.

See all papers in Proc. ACL that mention SVM.

Back to top.