Discriminative Pronunciation Modeling: A Large-Margin, Feature-Rich Approach
Tang, Hao and Keshet, Joseph and Livescu, Karen

Article Structure

Abstract

We address the problem of learning the mapping between words and their possible pronunciations in terms of sub-word units.

Introduction

One of the problems faced by automatic speech recognition, especially of conversational speech, is that of modeling the mapping between words and their possible pronunciations in terms of sub-word units such as phones.

Problem setting

We define a pronunciation of a word as a representation of the way it is produced by a speaker in terms of some set of linguistically meaningful sub-word units.

Algorithm

Similarly to previous work in structured prediction (Taskar et al., 2003; Tsochantaridis et al., 2005), we construct the function f from a predefined set of N feature functions, {gbj fill, each of the form gbj : 77* X V —> R. Each feature function takes a surface pronunciation 1‘9 and a proposed word w and returns a scalar which, intuitively, should be correlated with whether the pronunciation 1‘?

Feature functions

Before defining the feature functions, we define some notation.

Experiments

All experiments are conducted on a subset of the Switchboard conversational speech corpus that has

Discussion

The results in Section 5 are the best obtained thus far on the lexical access task on this conversational data set.

Topics

n-gram

Appears in 8 sentences as: n-gram (11)
In Discriminative Pronunciation Modeling: A Large-Margin, Feature-Rich Approach
  1. We use 1‘91.” to denote the n-gram substring p1 .
    Page 4, “Feature functions”
  2. The two substrings E and b are said to be equal if they have the same length and a7; 2 ()7; for 1 S i S n. For a given sub-word unit n-gram U E 73’", we use the shorthand U 6 T9 to mean that we can find U in 1‘9; i.e., there eXists an indeX i such that my”, 2 H. We use |f9| to denote the length of the sequence 1‘9.
    Page 4, “Feature functions”
  3. Similarly to (Zweig et al., 2010), we adapt TF and IDF by treating a sequence of sub-word units as a “document” and n-gram sub-sequences as “words.” In this analogy, we use sub-sequences in surface pronunciations to “search” for baseforms in the dictionary.
    Page 4, “Feature functions”
  4. features measure the frequency of each n-gram in observed pronunciations of a given word in the training set, along with the discriminative power of the n-gram .
    Page 4, “Feature functions”
  5. The term frequency of a sub-word unit n-gram U E 73’” in a sequence 1‘?
    Page 4, “Feature functions”
  6. Next, define the set of words in the training set that contain the n-gram U as V— = {w E V | (13,10) E S, U E To}.
    Page 4, “Feature functions”
  7. The inverse document frequency (IDF) of an n-gram U is defined as IVI IDFu log WEI .
    Page 4, “Feature functions”
  8. IDF represents the discriminative power of an n-gram: An n-gram that occurs in few words is better at word discrimination than a very common n-gram .
    Page 4, “Feature functions”

See all papers in Proc. ACL 2012 that mention n-gram.

See all papers in Proc. ACL that mention n-gram.

Back to top.

CRF

Appears in 7 sentences as: CRF (8) “CRF” (1)
In Discriminative Pronunciation Modeling: A Large-Margin, Feature-Rich Approach
  1. (2) under the log-loss results in a probabilistic model commonly known as a conditional random field ( CRF ) (Lafferty et al., 2001).
    Page 4, “Algorithm”
  2. 4We use the term “CRF” since the learning algorithm corresponds to CRF learning, although the task is multiclass classification rather than a sequence or structure prediction task.
    Page 7, “Experiments”
  3. CRF learning with the same features performs about 6% worse than the corresponding PA and Pegasos models.
    Page 7, “Experiments”
  4. The single-threaded running time for PNDP+ and Pegasos/DP+ is about 40 minutes per epoch, measured on a dual-core AMD 2.4GHz CPU with 8GB of memory; for CRF, it takes about 100 minutes for each epoch, which is almost entirely because the weight vector 0 is less sparse with CRF learning.
    Page 7, “Experiments”
  5. In the PA and Pegasos algorithms, we only update 0 for the most confusable word, while in CRF learning, we sum over all words.
    Page 7, “Experiments”
  6. In our case, the number of nonzero entries in 0 for PA and Pegasos is around 800,000; for CRF , it is over 4,000,000.
    Page 7, “Experiments”
  7. Large-margin learning, using the Passive-Aggressive and Pegasos algorithms, has benefits over CRF learning for our task: It produces sparser models, is faster, and produces better lexical access results.
    Page 8, “Discussion”

See all papers in Proc. ACL 2012 that mention CRF.

See all papers in Proc. ACL that mention CRF.

Back to top.

learning algorithm

Appears in 7 sentences as: learning algorithm (4) learning algorithms (3)
In Discriminative Pronunciation Modeling: A Large-Margin, Feature-Rich Approach
  1. In the next section we present a learning algorithm that aims to minimize the expected zero-one loss.
    Page 3, “Problem setting”
  2. We compare the CRF4, Passive-Aggressive (PA), and Pegasos learning algorithms .
    Page 7, “Experiments”
  3. 4We use the term “CRF” since the learning algorithm corresponds to CRF learning, although the task is multiclass classification rather than a sequence or structure prediction task.
    Page 7, “Experiments”
  4. Models labeled X/Y use learning algorithm X and feature set Y.
    Page 7, “Experiments”
  5. The remaining rows of Table 2 give results with our feature functions and various learning algorithms .
    Page 7, “Experiments”
  6. The results on the test fold are shown in Figure l, which compares the learning algorithms , and Figure 2, which compares feature sets.
    Page 8, “Experiments”
  7. extension of the model and learning algorithm to word sequences and (2) feature functions that relate acoustic measurements to sub-word units.
    Page 8, “Discussion”

See all papers in Proc. ACL 2012 that mention learning algorithm.

See all papers in Proc. ACL that mention learning algorithm.

Back to top.

error rate

Appears in 6 sentences as: error rate (4) error rates (2)
In Discriminative Pronunciation Modeling: A Large-Margin, Feature-Rich Approach
  1. In experiments on a subset of the Switchboard conversational speech corpus, our models thus far improve classification error rates from a previously published result of 29.1% to about 15%.
    Page 1, “Abstract”
  2. For generative models, phonetic error rate of generated pronunciations (Venkataramani and Byme, 2001) and
    Page 2, “Introduction”
  3. We measure performance by error rate (ER), the proportion of test examples predicted incorrectly.
    Page 7, “Experiments”
  4. We can see that, by adding just the Levenshtein distance, the error rate drops signif-
    Page 7, “Experiments”
  5. Table 2: Lexical access error rates (ER) on the same data split as in (Livescu and Glass, 2004; Jyothi et al., 2011).
    Page 7, “Experiments”
  6. Each point corresponds to the test set error rate for one of the 5 data splits.
    Page 8, “Experiments”

See all papers in Proc. ACL 2012 that mention error rate.

See all papers in Proc. ACL that mention error rate.

Back to top.

development set

Appears in 5 sentences as: development set (5)
In Discriminative Pronunciation Modeling: A Large-Margin, Feature-Rich Approach
  1. The regularization parameter A is tuned on the development set .
    Page 7, “Experiments”
  2. We run all three algorithms for multiple epochs and pick the best epoch based on development set performance.
    Page 7, “Experiments”
  3. For the first set of experiments, we use the same division of the corpus as in (Livescu and Glass, 2004; Jyothi et al., 2011) into a 2492—word training set, a 165-word development set , and a 236-word test set.
    Page 7, “Experiments”
  4. The best result for PNDP+ (the PA algorithm using all features besides the DBN features) on the development set is with A = 100 and 5 epochs.
    Page 7, “Experiments”
  5. The best result for Pegasos with the same features on the development set is with A = 0.01 and 10 epochs.
    Page 7, “Experiments”

See all papers in Proc. ACL 2012 that mention development set.

See all papers in Proc. ACL that mention development set.

Back to top.

weight vector

Appears in 5 sentences as: weight vector (5) weight vectors (1)
In Discriminative Pronunciation Modeling: A Large-Margin, Feature-Rich Approach
  1. Finding the weight vector 0 that minimizes the £2-regularized average of this loss function is the structured support vector machine (SVM) problem (Taskar et al., 2003; Tsochantaridis et al., 2005):
    Page 3, “Algorithm”
  2. Denote by 0t_1 the value of the weight vector before the t-th round.
    Page 3, “Algorithm”
  3. Let Agbi = qb(}_9i, wi) — qbQ‘oi, Then the algorithm updates the weight vector 0’5 as follows:
    Page 3, “Algorithm”
  4. The final weight vector is set to the average over all weight vectors during training.
    Page 4, “Algorithm”
  5. The single-threaded running time for PNDP+ and Pegasos/DP+ is about 40 minutes per epoch, measured on a dual-core AMD 2.4GHz CPU with 8GB of memory; for CRF, it takes about 100 minutes for each epoch, which is almost entirely because the weight vector 0 is less sparse with CRF learning.
    Page 7, “Experiments”

See all papers in Proc. ACL 2012 that mention weight vector.

See all papers in Proc. ACL that mention weight vector.

Back to top.

best result

Appears in 4 sentences as: best result (3) best results (1)
In Discriminative Pronunciation Modeling: A Large-Margin, Feature-Rich Approach
  1. The best results are in bold; the differences among them are insignificant (according to McNemar’s test with p = .05).
    Page 7, “Experiments”
  2. The best result for PNDP+ (the PA algorithm using all features besides the DBN features) on the development set is with A = 100 and 5 epochs.
    Page 7, “Experiments”
  3. The best result for Pegasos with the same features on the development set is with A = 0.01 and 10 epochs.
    Page 7, “Experiments”
  4. Combining the two gives close to our best result .
    Page 8, “Experiments”

See all papers in Proc. ACL 2012 that mention best result.

See all papers in Proc. ACL that mention best result.

Back to top.

bigrams

Appears in 4 sentences as: bigram (1) bigrams (3)
In Discriminative Pronunciation Modeling: A Large-Margin, Feature-Rich Approach
  1. In practice, we only consider n-grams of a certain order (e. g., bigrams ).
    Page 4, “Feature functions”
  2. Then for the bi-gram /1 iy/, we have TF/liy/(fo) = 1/5 (one out of five bigrams in 1—9), and IDF /1 iy / = log(2 / 1) (one word out of two in the dictionary).
    Page 4, “Feature functions”
  3. The TF-IDF features used in the experiments are based on phone bigrams .
    Page 7, “Experiments”
  4. In the figure, phone bigram TF—IDF is labeled p2; phonetic alignment with dynamic programming is labeled DP.
    Page 8, “Discussion”

See all papers in Proc. ACL 2012 that mention bigrams.

See all papers in Proc. ACL that mention bigrams.

Back to top.

generative models

Appears in 4 sentences as: generative model (1) generative modeling (1) generative models (2)
In Discriminative Pronunciation Modeling: A Large-Margin, Feature-Rich Approach
  1. Most previous approaches have involved generative modeling of the distribution of pronunciations, usually trained to maximize likelihood.
    Page 1, “Abstract”
  2. In other words, these approaches optimize generative models using discriminative criteria.
    Page 2, “Introduction”
  3. We propose a general, flexible discriminative approach to pronunciation modeling, rather than dis-criminatively optimizing a generative model .
    Page 2, “Introduction”
  4. For generative models , phonetic error rate of generated pronunciations (Venkataramani and Byme, 2001) and
    Page 2, “Introduction”

See all papers in Proc. ACL 2012 that mention generative models.

See all papers in Proc. ACL that mention generative models.

Back to top.

dynamic programming

Appears in 3 sentences as: dynamic programming (3)
In Discriminative Pronunciation Modeling: A Large-Margin, Feature-Rich Approach
  1. Given (13,10), we use dynamic programming to align the surface form 1‘9 with all of the baseforms of w. Following (Riley et al., 1999), we encode a phoneme/phone with a 4-tuple: consonant manner, consonant place, vowel manner, and vowel place.
    Page 5, “Feature functions”
  2. The feature selection experiments in Figure 2 shows that the TF—IDF features alone are quite weak, while the dynamic programming alignment features alone are quite good.
    Page 8, “Experiments”
  3. In the figure, phone bigram TF—IDF is labeled p2; phonetic alignment with dynamic programming is labeled DP.
    Page 8, “Discussion”

See all papers in Proc. ACL 2012 that mention dynamic programming.

See all papers in Proc. ACL that mention dynamic programming.

Back to top.

feature set

Appears in 3 sentences as: feature set (2) feature sets (1)
In Discriminative Pronunciation Modeling: A Large-Margin, Feature-Rich Approach
  1. Models labeled X/Y use learning algorithm X and feature set Y.
    Page 7, “Experiments”
  2. The feature set DP+ contains TF—IDF, DP alignment, dictionary, and length features.
    Page 7, “Experiments”
  3. The results on the test fold are shown in Figure l, which compares the learning algorithms, and Figure 2, which compares feature sets .
    Page 8, “Experiments”

See all papers in Proc. ACL 2012 that mention feature set.

See all papers in Proc. ACL that mention feature set.

Back to top.