Grounded Language Modeling for Automatic Speech Recognition of Sports Video
Fleischman, Michael and Roy, Deb

Article Structure

Abstract

Grounded language models represent the relationship between words and the nonlinguistic context in which they are said.

Introduction

Recognizing speech in broadcast video is a necessary precursor to many multimodal applications such as video search and summarization (Snoek and Worring, 2005;).

Representing Events in Sports Video

Recent work in video surveillance has demonstrated the benefit of representing complex events as temporal relations between lower level sub-events (Hongen et al., 2004).

Linguistic Mapping

Modeling the relationship between words and nonlinguistic context assumes that the speech uttered in a video refers consistently (although not exclusively) to the events being represented by the temporal pattern features.

Evaluation

In order to evaluate our grounded language modeling approach, a parallel data set of 99 Major League Baseball games with corresponding closed captioning transcripts was recorded from live television.

Conclusions

We have described a method for improving speech recognition in video.

Topics

language model

Appears in 46 sentences as: language model (31) language modeling (7) language models (27)
In Grounded Language Modeling for Automatic Speech Recognition of Sports Video
  1. Grounded language models represent the relationship between words and the nonlinguistic context in which they are said.
    Page 1, “Abstract”
  2. Results show that grounded language models improve perplexity and word error rate over text based language models , and further, support video information retrieval better than human generated speech transcriptions.
    Page 1, “Abstract”
  3. The method is based on the use of grounded language models to repre-
    Page 1, “Introduction”
  4. Grounded language models are based on research from cognitive science on grounded models of meaning.
    Page 1, “Introduction”
  5. This paper extends previous work on grounded models of meaning by learning a grounded language model from naturalistic data collected from broadcast video of Major League Baseball games.
    Page 1, “Introduction”
  6. 1 This corpus is used to train the grounded language model, which like traditional language models encode the prior probability of words for an ASR system.
    Page 1, “Introduction”
  7. Unlike traditional language models, however, grounded language models represent the probability of a word conditioned not only on the previous word(s), but also on features of the nonlinguistic context in which the word was uttered.
    Page 1, “Introduction”
  8. Our approach to learning grounded language models operates in two phases.
    Page 1, “Introduction”
  9. In the following sections we describe these two aspects of our approach and evaluate the performance of our grounded language model on a speech recognition task using video highlights from Major League Baseball games.
    Page 2, “Introduction”
  10. We model this relationship, much like traditional language models , using conditional probability distributions.
    Page 4, “Linguistic Mapping”
  11. Unlike traditional language models, however, our grounded language models condition the probability of a word not only on the word(s) uttered before it, but also on the temporal pattern features that describe the nonlinguistic context in which it was uttered.
    Page 4, “Linguistic Mapping”

See all papers in Proc. ACL 2008 that mention language model.

See all papers in Proc. ACL that mention language model.

Back to top.

error rate

Appears in 13 sentences as: Error Rate (2) error rate (9) error rates (2)
In Grounded Language Modeling for Automatic Speech Recognition of Sports Video
  1. Results show that grounded language models improve perplexity and word error rate over text based language models, and further, support video information retrieval better than human generated speech transcriptions.
    Page 1, “Abstract”
  2. Results indicate improved performance using three metrics: perplexity, word error rate , and precision on an information retrieval task.
    Page 2, “Introduction”
  3. We evaluate our grounded language modeling approach using 3 metrics: perpleXity, word error rate , and precision on an information retrieval task.
    Page 6, “Evaluation”
  4. 4.2 Word Accuracy and Error Rate
    Page 6, “Evaluation”
  5. Word error rate (WER) is a normalized measure of the number of word insertions, substitutions, and deletions required to transform the output transcription of an ASR system to a human generated gold standard transcription of the same utterance.
    Page 6, “Evaluation”
  6. Unlike perplexity which only evaluates the performance of language models, examining word accuracy and error rate requires running an entire ASR system, i.e.
    Page 6, “Evaluation”
  7. Word Error Rate (WER) \I O1
    Page 7, “Evaluation”
  8. Word accuracy and error rates for ASR systems using a grounded language model, a text based language model trained on the switchboard corpus, and the switchboard model interpolated with a text based model trained on baseball closed captions.
    Page 7, “Evaluation”
  9. Even with this noise, however, results indicate that the word accuracy and error rates when using the grounded language model is significantly better than both the switchboard model (absolute WER reduction of 13%; absolute accuracy increase of 15.2%) and the switchboard interpolated with the baseball specific text based language model (absolute WER reduction of 3.7%; absolute accuracy increase of 5.9%).
    Page 7, “Evaluation”
  10. Drawing conclusions about the usefulness of grounded language models using word accuracy or error rate alone is difficult.
    Page 7, “Evaluation”
  11. Thus, in the next section we examine an extrinsic evaluation in which grounded language models are judged not directly on their effect on word accuracy or error rate , but based on their ability to support video information retrieval.
    Page 7, “Evaluation”

See all papers in Proc. ACL 2008 that mention error rate.

See all papers in Proc. ACL that mention error rate.

Back to top.

bigram

Appears in 5 sentences as: bigram (4) bigrams (1)
In Grounded Language Modeling for Automatic Speech Recognition of Sports Video
  1. Estimating bigram and trigram models can be done by processing on word pairs or triples, and performing normalization on the resulting conditional distributions.
    Page 5, “Linguistic Mapping”
  2. The remaining 93 unlabeled games are used to train unigram, bigram , and trigram grounded language models.
    Page 5, “Evaluation”
  3. Only unigrams, bigrams , and tri—grams that are not proper names, appear greater than three times, and are not composed only of stop words were used.
    Page 5, “Evaluation”
  4. with traditional unigram, bigram , and trigram language models generated from a combination of the closed captioning transcripts of all training games and data from the switchboard corpus (see below).
    Page 6, “Evaluation”
  5. are then used to find the phrases (unigram, trigram, and bigram ) most indicative of each label (e. g. “fly ball” for category fly ball).
    Page 8, “Evaluation”

See all papers in Proc. ACL 2008 that mention bigram.

See all papers in Proc. ACL that mention bigram.

Back to top.

unigram

Appears in 5 sentences as: unigram (4) unigrams (1)
In Grounded Language Modeling for Automatic Speech Recognition of Sports Video
  1. 3 In the discussion that follows, we describe a method for estimating unigram grounded language models.
    Page 5, “Linguistic Mapping”
  2. The remaining 93 unlabeled games are used to train unigram , bigram, and trigram grounded language models.
    Page 5, “Evaluation”
  3. Only unigrams , bigrams, and tri—grams that are not proper names, appear greater than three times, and are not composed only of stop words were used.
    Page 5, “Evaluation”
  4. with traditional unigram , bigram, and trigram language models generated from a combination of the closed captioning transcripts of all training games and data from the switchboard corpus (see below).
    Page 6, “Evaluation”
  5. are then used to find the phrases ( unigram , trigram, and bigram) most indicative of each label (e. g. “fly ball” for category fly ball).
    Page 8, “Evaluation”

See all papers in Proc. ACL 2008 that mention unigram.

See all papers in Proc. ACL that mention unigram.

Back to top.

gold standard

Appears in 4 sentences as: gold standard (3) gold standards (1)
In Grounded Language Modeling for Automatic Speech Recognition of Sports Video
  1. From this test set, baseball highlights (i.e., events which terminate with the player either out or safe) were hand annotated for use in evaluation, and manually transcribed in order to get clean text transcriptions for gold standard comparisons.
    Page 5, “Evaluation”
  2. Word error rate (WER) is a normalized measure of the number of word insertions, substitutions, and deletions required to transform the output transcription of an ASR system to a human generated gold standard transcription of the same utterance.
    Page 6, “Evaluation”
  3. Word accuracy is simply the number of words in the gold standard that they system correctly recognized.
    Page 6, “Evaluation”
  4. Although this is somewhat counterintuitive given that hand transcriptions are typically considered gold standards , these results follow from a limitation of using text—based methods to index video.
    Page 8, “Evaluation”

See all papers in Proc. ACL 2008 that mention gold standard.

See all papers in Proc. ACL that mention gold standard.

Back to top.

LDA

Appears in 3 sentences as: LDA (3)
In Grounded Language Modeling for Automatic Speech Recognition of Sports Video
  1. In this work we follow closely the Author-Topic (AT) model (Steyvers et al., 2004) which is a generalization of Latent Dirichlet Allocation ( LDA ) (Blei et al., 2005).3
    Page 5, “Linguistic Mapping”
  2. LDA is a technique that was developed to model the distribution of topics discussed in a large corpus of documents.
    Page 5, “Linguistic Mapping”
  3. The AT model generalizes LDA , saying that the mixture of topics is not dependent on the document itself, but rather on the authors who wrote it.
    Page 5, “Linguistic Mapping”

See all papers in Proc. ACL 2008 that mention LDA.

See all papers in Proc. ACL that mention LDA.

Back to top.

probability distribution

Appears in 3 sentences as: probability distribution (2) probability distributions (1)
In Grounded Language Modeling for Automatic Speech Recognition of Sports Video
  1. In the second phase, a conditional probability distribution is estimated that describes the probability that a word was uttered given such event representations.
    Page 2, “Introduction”
  2. We model this relationship, much like traditional language models, using conditional probability distributions .
    Page 4, “Linguistic Mapping”
  3. The model assumes that every document is made up of a mixture of topics, and that each word in a document is generated from a probability distribution associated with one of those topics.
    Page 5, “Linguistic Mapping”

See all papers in Proc. ACL 2008 that mention probability distribution.

See all papers in Proc. ACL that mention probability distribution.

Back to top.