Summarizing multiple spoken documents: finding evidence from untranscribed audio
Zhu, Xiaodan and Penn, Gerald and Rudzicz, Frank

Article Structure

Abstract

This paper presents a model for summarizing multiple untranscribed spoken documents.

Introduction

Summarizing spoken documents has been extensively studied over the past several years (Penn and Zhu, 2008; Maskey and Hirschberg, 2005; Murray et al., 2005; Christensen et al., 2004; Zechner, 2001).

Related work

2.1 Speech summarization

An acoustics-based approach

The acoustics-based summarization technique proposed in this paper consists of three consecutive components.

Experimental setup

We use the TDT—4 dataset for our evaluation, which consists of annotated news broadcasts grouped into common topics.

Experimental results

We aim to empirically determine the extent to which acoustic information alone can effectively replace conventional speech recognition within the multi-document speech summarization task.

Conclusions and future work

In text summarization, statistics based on word counts have traditionally served as the foundation of state-of-the-art models.

Topics

similarity scores

Appears in 14 sentences as: similarity score (4) similarity scores (10)
In Summarizing multiple spoken documents: finding evidence from untranscribed audio
  1. Park-Glass similarity scores by themselves can attribute a high score to distorted paths that, in our context, ultimately leads to too many false-alarm alignments, even after applying the distortion threshold.
    Page 2, “Introduction”
  2. MEAD uses a redundancy removal mechanism similar to MMR, but to decide the salience of a sentence to the whole topic, MEAD uses not only its similarity score but also sentence position, e.g., the first sentence of each new story is considered important.
    Page 3, “Related work”
  3. The upper panel of Figure 1 shows a matrix of frame-level similarity scores between these two utterances where lighter grey represents higher similarity.
    Page 4, “An acoustics-based approach”
  4. All similarity scores are then normalized to the range of [0, l], which yields similarity matrices exemplified in the upper panel of Figure 1.
    Page 4, “An acoustics-based approach”
  5. Given an M -by-N matrix of frame-level similarity scores , the top-left corner is considered the origin, and the bottom-right comer represents an alignment of the last frames in each sequence.
    Page 4, “An acoustics-based approach”
  6. For each of the multiple starting points p0 = (:30, yo) where either :30 = 0 or yo = 0, but not necessarily both, we apply DTW to find paths P = 190,191, ...,pK that maximize 20$ 1-: Ksim(pi), where sim(pi) is the cosine similarity score of point p,- = (90,-, in the matrix.
    Page 4, “An acoustics-based approach”
  7. In our implementation, since most of the time is spent on calculating the average similarity scores on candidate subpaths, all average scores are therefore precalculated incrementally and saved.
    Page 5, “An acoustics-based approach”
  8. We have obtained the average similarity score of
    Page 5, “An acoustics-based approach”
  9. Based on this, we calculate relative similarity scores , which are computed by dividing the original similarity of a given subpath by the average similarity of its surrounding background.
    Page 5, “An acoustics-based approach”
  10. Given two subpaths with nearly identical average similarity scores , we suggest that the longer of the two is more likely to refer to content of interest that is shared between two speech utterances, e.g., named entities.
    Page 5, “An acoustics-based approach”
  11. The similarity scores of subpaths can vary widely over different spoken documents.
    Page 5, “An acoustics-based approach”

See all papers in Proc. ACL 2009 that mention similarity scores.

See all papers in Proc. ACL that mention similarity scores.

Back to top.

Appears in 4 sentences as: (4) error rate (3) error rates (2)
In Summarizing multiple spoken documents: finding evidence from untranscribed audio
  1. We compare the performance of this model with that achieved using manual and automatic transcripts, and find that this new approach is roughly equivalent to having access to ASR transcripts with word error rates in the 33—37% range without actually having to do the ASR, plus it better handles utterances with out-of-vocabulary words.
    Page 1, “Abstract”
  2. These transcripts contain a word error rate of 12.6%, which is comparable to the best accuracies obtained in the literature on this data set.
    Page 6, “Experimental setup”
  3. Since ASR performance can vary greatly as we discussed above, we compare our system against automatic transcripts having word error rates of 12.6%, 20.9%, 29.2%, and 35.5% on the same speech source.
    Page 7, “Experimental results”
  4. I a: - ., 0.75 0.75 0.7 0.7 0 0.1 0.2 0.3 0.4 0.5 0 0.1 0.2 0.3 0.4 0.5 Word error rate Word error rate 1 1 Len=20%, Rand=0.324 Len=2o%, Rand=o_340 0.95 0.95 N I 0.9 0.9 e L’" D 0.85 o 0.85 g 0.8 8 0.8 v a: 0.75 0.75 0.7 0.7 0 0.1 0.2 0.3 0.4 0.5 0 0.1 0.2 0.3 0.4 0.5 Word error rate Word error rate 1 1 Len=30‘V , Rand=0.389 _ _ 0 95 0 095 Len—30%, Rand—0.402 ES N I 0.9 0.9 e L’" D 0.85 o 0.85 8 0.8 8 0.8 a: 0 75 0 75 0.7 0.7 0 0.1 0.2 0.3 0.4 0.5 0 0.1 0.2 0.3 0.4 0.5 Word error rate Word error rate
    Page 7, “Experimental results”

See all papers in Proc. ACL 2009 that mention .

See all papers in Proc. ACL that mention .

Back to top.

named entities

Appears in 3 sentences as: named entities (3)
In Summarizing multiple spoken documents: finding evidence from untranscribed audio
  1. This out-of-vocabulary (OOV) problem is unavoidable in the regular ASR framework, although it is more likely to happen on salient words such as named entities or domain-specific terms.
    Page 2, “Introduction”
  2. Given two subpaths with nearly identical average similarity scores, we suggest that the longer of the two is more likely to refer to content of interest that is shared between two speech utterances, e.g., named entities .
    Page 5, “An acoustics-based approach”
  3. Although named entities and domain-specific terms are often highly relevant to the documents in which they are referenced, these types of words are often not included in ASR vocabularies, due to their relative global rarity.
    Page 8, “Experimental results”

See all papers in Proc. ACL 2009 that mention named entities.

See all papers in Proc. ACL that mention named entities.

Back to top.