Abstract | Results show that grounded language models improve perplexity and word error rate over text based language models, and further, support video information retrieval better than human generated speech transcriptions. |
Evaluation | We evaluate our grounded language modeling approach using 3 metrics: perpleXity, word error rate , and precision on an information retrieval task. |
Evaluation | 4.2 Word Accuracy and Error Rate |
Evaluation | Word error rate (WER) is a normalized measure of the number of word insertions, substitutions, and deletions required to transform the output transcription of an ASR system to a human generated gold standard transcription of the same utterance. |
Introduction | Results indicate improved performance using three metrics: perplexity, word error rate , and precision on an information retrieval task. |
Abstract | In comparison with a state-of-the-art syllabification system, we reduce the syllabification word error rate for English by 33%. |
Introduction | With this approach, we reduce the error rate for English by 33%, relative to the best existing system. |
L2P Performance | In English, perfect syllabification produces a relative error reduction of 10.6%, and our model captures over half of the possible improvement, reducing the error rate by 6.0%. |
L2P Performance | Although perfect syllabification reduces their L2P relative error rate by 18%, they find that their learned model actually increases the error rate . |
L2P Performance | For Dutch, perfect syllabification reduces the relative L2P error rate by 17.5%; we realize over 70% of the available improvement with our syllabification model, reducing the relative error rate by 12.4%. |
Syllabification Experiments | Syllable break error rate (SB ER) captures the incorrect tags that cause an error in syllabification. |
Syllabification Experiments | Table 1 presents the word accuracy and syllable break error rate achieved by each of our tag sets on both the CELEX and NETtalk datasets. |
Syllabification Experiments | Overall, our best tag set lowers the error rate by one-third, relative to SbA’s performance. |
Abstract | This paper analyzes a variety of lexical, prosodic, and disfluency factors to determine which are likely to increase ASR error rates . |
Abstract | (3) Although our results are based on output from a system with speaker adaptation, speaker differences are a major factor influencing error rates , and the effects of features such as frequency, pitch, and intensity may vary between speakers. |
Data | The standard measure of error used in ASR is word error rate (WER), computed as 100(I —|— D —|—S ) / R, where I , D and S are the number of insertions, deletions, and substitutions found by aligning the ASR hypotheses with the reference transcriptions, and R is the number of reference words. |
Data | Since we wish to know what features of a reference word increase the probability of an error, we need a way to measure the errors attributable to individual words — an individual word error rate (IWER). |
Introduction | Previous work on recognition of spontaneous monologues and dialogues has shown that infrequent words are more likely to be misrecognized (Fosler—Lussier and Morgan, 1999; Shinozaki and Furui, 2001) and that fast speech increases error rates (Siegler and Stern, 1995; Fosler—Lussier and Morgan, 1999; Shinozaki |
Introduction | Siegler and Stern (1995) and Shinozaki and Furui (2001) also found higher error rates in very slow speech. |
Introduction | Word length (in phones) has also been found to be a useful predictor of higher error rates (Shinozaki and Furui, 2001). |
Abstract | We report a significant reduction in word error rate compared to a state-of-the-art baseline system. |
Experiments | For a given test set we could then compare the word error rate of the baseline system with that of the extended system employing the grammar-based language model. |
Experiments | tionally high baseline word error rate . |
Experiments | These classes are interviews (a word error rate of 36.1%), sports reports (28.4%) and press conferences (25.7%). |
Language Model 2.1 The General Approach | The influence of N on the word error rate is discussed in the results section. |
Abstract | We create new training sets for English and Dutch from the CELEX European lexical resource, and achieve error rates for English of less than 0.1% for correctly allowed hyphens, and less than 0.01% for Dutch. |
Abstract | Experiments show that both the Knuth/Liang method and a leading current commercial altema-tive have error rates several times higher for both languages. |
Experimental design | In order to measure accuracy, we compute the confusion matrix for each method, and from this we compute error rates . |
Experimental design | We report both word-level and letter-level error rates . |
Experimental design | The word-level error rate is the fraction of words on which a method makes at least one mistake. |
Experimental results | Figure 1 shows how the error rate is affected by increasing the CRF probability threshold for each language. |
Experimental results | Figure 1 shows confidence intervals for the error rates . |
Experimental results | All differences between rows in Table 2 are significant, with one exception: the serious error rates for PAT GEN and TALO are not statistically significantly different. |
History of automated hyphenation | The lowest per-letter test error rate reported is about 2%. |
Getting Humans in the Loop | Figure 6: The relative error rate (using round 0 as a baseline) of the best Mechanical Turk user session for each of the four numbers of topics. |
Simulation Experiment | The lower the classification error rate , the better the model has captured the structure of the corpus.4 |
Simulation Experiment | While Null sees no constraints, it serves as an upper baseline for the error rate (lower error being better) but shows the effect of additional inference. |
Simulation Experiment | All Full is a lower baseline for the error rate since it both sees the constraints at the beginning and also runs for the maximum number of total iterations. |
Conclusions | Experiments have shown that this randomized language model can be combined with entropy pruning to achieve further memory reductions; that error rates occurring in practice are much lower than those predicted by theoretical analysis due to the use of runtime sanity checks; and that the same translation quality as a lossless language model representation can be achieved when using 12 ‘error’ bits, resulting in approx. |
Experiments | Section (3) analyzed the theoretical error rate; here, we measure error rates in practice when retrieving n-grams for approx. |
Experiments | The error rates for bigrams are close to their expected values. |
Perfect Hash-based Language Models | There is a tradeoff between space and error rate since the larger B is, the lower the probability of a false positive. |
Perfect Hash-based Language Models | For example, if |V| is 128 then taking B = 1024 gives an error rate of e = 128/1024 = 0.125 with each entry in A using [log2 1024] = 10 bits. |
Perfect Hash-based Language Models | Querying each of these arrays for each n-gram requested would be inefficient and inflate the error rate since a false positive could occur on each individual array. |
Scaling Language Models | The space required in such a lossy encoding depends only on the range of values associated with the n-grams and the desired error rate , i.e. |
Conclusion | The evaluation results showed that the proposed method significantly improved word alignment, achieving an absolute error rate reduction of 29% on multi-word alignment. |
Experiments on Phrase-Based SMT | And Koehn's implementation of minimum error rate training (Och, 2003) is used to tune the feature weights on the development set. |
Experiments on Word Alignment | For multi-word alignments, our methods significantly outperform the baseline method in terms of both precision and recall, achieving up to 18% absolute error rate reduction. |
Experiments on Word Alignment | CM-3, the error rate of multi-word alignment results is further reduced. |
Experiments on Word Alignment | We can see that WA-l achieves lower alignment error rate as compared to the baseline method, since the performance of the improved one-directional alignment method is better than that of GIZA++. |
Introduction | The evaluation results show that the proposed method in this paper significantly improves mul-ti-word alignment, achieving an absolute error rate reduction of 29%. |
Conclusion | The CRFs achieves lower error rate on the tagging task but RNN trained model is better for the translation task. |
Experiments | Several experiments have been done to find the suitable hy-perparameter p1 and p2, we choose the model with lowest error rate on validation corpus for translation experiments. |
Experiments | The error rate of the chosen model on test corpus (the test corpus in Table 2) is 25.75% for token error rate and 69.39% for sequence error rate . |
Experiments | The RNN has a token error rate of 27.31% and a sentence error rate of 77.00% over the test corpus in Table 2. |
Translation System Overview | The model scaling factors M” are trained with Minimum Error Rate Training (MERT). |
Experimental evaluation | It is hard to quote the verbatim word error rate of the recognizer, because this would require a careful and time-consuming manual transcription of the test set. |
Experimental evaluation | Using the alignment we compute precision and recall for sections headings and punctuation marks as well as the overall token error rate . |
Experimental evaluation | It should be noted that the so derived error rate is not comparable to word error rates usually reported in speech recognition research. |
Probabilistic model | The decision rule (1) minimizes the document error rate . |
Transformation based learning | This method iteratively improves the match (as measured by token error rate ) of a collection of corresponding source and target token sequences by positing and applying a sequence of substitution rules. |
Abstract | In experiments on a subset of the Switchboard conversational speech corpus, our models thus far improve classification error rates from a previously published result of 29.1% to about 15%. |
Experiments | We measure performance by error rate (ER), the proportion of test examples predicted incorrectly. |
Experiments | We can see that, by adding just the Levenshtein distance, the error rate drops signif- |
Experiments | Table 2: Lexical access error rates (ER) on the same data split as in (Livescu and Glass, 2004; Jyothi et al., 2011). |
Introduction | For generative models, phonetic error rate of generated pronunciations (Venkataramani and Byme, 2001) and |
Abstract | Overall, our system substantially outperforms state-of-the-art solutions for this task, achieving a 31% relative reduction in word error rate over the leading commercial system for historical transcription, and a 47% relative reduction over Tesseract, Google’s open source OCR system. |
Experiments | We evaluate the output of our system and the baseline systems using two metrics: character error rate (CER) and word error rate (WER). |
Experiments | Table 1: We evaluate the predicted transcriptions in terms of both character error rate (CER) and word error rate (WER), and report macro-averages across documents. |
Introduction | For example, even state-of-the-art OCR systems produce word error rates of over 50% on the documents shown in Figure l. Unsurprisingly, such error rates are too high for many research projects (Arlitsch and Herbert, 2004; Shoemaker, 2005; Holley, 2010). |
Learning | On document (a), which exhibits noisy typesetting, our system achieves a word error rate (WER) of 25.2. |
Experiments | One possible reason is that we only used n-best derivations instead of all possible derivations for minimum error rate training. |
Extended Minimum Error Rate Training | Minimum error rate training (Och, 2003) is widely used to optimize feature weights for a linear model (Och and Ney, 2002). |
Extended Minimum Error Rate Training | The key idea of MERT is to tune one feature weight to minimize error rate each time while keep others fixed. |
Extended Minimum Error Rate Training | Unfortunately, minimum error rate training cannot be directly used to optimize feature weights of max-translation decoding because Eq. |
Introduction | 0 As multiple derivations are used for finding optimal translations, we extend the minimum error rate training (MERT) algorithm (Och, 2003) to tune feature weights with respect to BLEU score for max-translation decoding (Section 4). |
Experimental Setup | A word error rate (WER) of 31% on the SM dataset was observed. |
Experimental Setup | However, due to substantial amount of speech recognition errors in our data, the POS error rate (resulting from the combined errors of ASR and automated POS tagger) is expected to be higher. |
Related Work | Automatic recognition of nonnative speakers’ spontaneous speech is a challenging task as evidenced by the error rate of the state-of-the- |
Related Work | For instance, Chen and Zechner (2011) reported a 50.5% word error rate (WER) and Yoon and Bhat (2012) reported a 30% WER in the recognition of ESL students’ spoken responses. |
Related Work | These high error rates at the recognition stage negatively affect the subsequent stages of the speech scoring system in general, and in particular, during a deep syntactic analysis, which operates on a long sequence of words as its context. |
Abstract | We achieve 10.1% and 11.4% reduction in recognition word error rate (WER) relative to a standard baseline system on typewritten and handwritten Arabic respectively. |
Discriminative Reranking for OCR | The loss is computed as the Word Error Rate (WER) of the |
Discriminative Reranking for OCR | During training, the weights are updated according to the Margin-Infused Relaxed Algorithm (MIRA), whenever the highest scoring hypothesis differs from the hypothesis with the lowest error rate . |
Experiments | We note that a small number of hypotheses per list is sufficient for RankSVM to obtain a good performance, but also increasing 73 further seems to increase the error rate . |
Abstract | We present experimental results for heavily pruned backoff n-gram models, and demonstrate perplexity and word error rate reductions when used with various baseline smoothing methods. |
Experimental results | We now look at the impacts on system performance we can achieve with these new models4, and whether the perplexity differences that we observe translate to real error rate reductions. |
Experimental results | Word error rate (WER) |
Experimental results | The perplexity reductions that were achieved for these models do translate to real word error rate reductions at both stages of between 0.5 and 0.9 percent absolute. |
Abstract | A linear model is defined over derivations, and minimum error rate training is used to tune feature weights based on a set of question-answer pairs. |
Introduction | Derivations generated during such a translation procedure are modeled by a linear model, and minimum error rate training (MERT) (Och, 2003) is used to tune feature weights based on a set of question-answer pairs. |
Introduction | Given a set of question-answer pairs {Qh A1?” } as the development (dev) set, we use the minimum error rate training (MERT) (Och, 2003) algorithm to tune the feature weights A11” in our proposed model. |
Introduction | A lower disambiguation error rate results in better relation expressions. |
Conclusion | However, our model’s performance today correlates strongly with an orthogonal accuracy metric, word error rate , on unseen data. |
Conclusion | This suggests that “retry rate” is a reasonable offline quality metric, to be considered in context among other metrics and traditional evaluation based on word error rate . |
Introduction | In particular, we seek to measure and minimize the word error rate (WER) of a system, with a WER of zero indicating perfect transcription. |
Prediction task | We do not have retry annotations for this larger set, but we have transcriptions for the first member of each query pair, enabling us to calculate the word error rate (WER) of each query’s recognition hypothesis, and thus obtain ground truth for half of our retry definition. |
Abstract | The experimental results demonstrate that our model is able to significantly outperform the state-of-the-art coherence model by Barzilay and Lapata (2005), reducing the error rate of the previous approach by an average of 29% over three data sets against human upper bounds. |
Experiments | For the combined model, the error rates are significantly reduced in all three data sets. |
Experiments | The average error rate reductions against 100% are 9.57% for the full model and 26.37% for the combined model. |
Experiments | If we compute the average error rate reductions against the human upper bounds (rather than an oracular 100%), the average error rate reduction for the full model is 29% and that for the combined model is 73%. |
Abstract | Evaluation results show a 44% reduction in the error rate relative to the best prior systems, averaging over all metrics, and up to 61% reduction in the error rate on grammaticality judgments. |
Conclusions | Our system achieved a 44% reduction in the error rate relative to both the Heilman and Smith, and the Lindberg et al. |
Results | As seen in Table 4, our results represent a 44% reduction in the error rate relative to Heilman and Smith on the average rating over all metrics, and as high as 61% reduction in the error rate on grammaticality judgments. |
Results | Interestingly, our system again achieved a 44% reduction in the error rate when averaging over all metrics, just as it did in the Heilman and Smith comparison. |
Abstract | The experimental results show that l) linguistic features alone outperform word posterior probability based confidence estimation in error detection; and 2) linguistic features can further provide complementary information when combined with word confidence scores, which collectively reduce the classification error rate by 18.52% and improve the F measure by 16.37%. |
Experiments | To determine the true class of a word in a generated translation hypothesis, we follow (Blatz et al., 2003) to use the word error rate (WER). |
Experiments | To evaluate the overall performance of the error detection, we use the commonly used metric, classification error rate (CER) to evaluate our classifiers. |
SMT System | For minimum error rate tuning (Och, 2003), we use NIST MT-02 as the development set for the translation task. |
Experiments and results | Table 2: Percentage of positive samples, and averaged error rate for positive (P) and negative (N) samples for the first 20 iterations using the agreement-based and our confidence labeling methods. |
Experiments and results | Table 2 shows the percentage of the positive samples added for the first 20 iterations, and the average labeling error rate of those samples for the self-labeled positive and negative classes for two methods. |
Experiments and results | The agreement-based random selection added more negative samples that also have higher error rate than the positive samples. |
Conditional Random Fields | Results are reported in terms of phoneme error rates or tag error rates on the test set. |
Conditional Random Fields | Table 1: Features jointly testing label pairs and the observation are useful ( error rates and features counts.) |
Conditional Random Fields | Table 3: Error rates of the three regularizers on the Nettalk task. |
Abstract | We compare the performance of this model with that achieved using manual and automatic transcripts, and find that this new approach is roughly equivalent to having access to ASR transcripts with word error rates in the 33—37% range without actually having to do the ASR, plus it better handles utterances with out-of-vocabulary words. |
Experimental results | Since ASR performance can vary greatly as we discussed above, we compare our system against automatic transcripts having word error rates of 12.6%, 20.9%, 29.2%, and 35.5% on the same speech source. |
Experimental results | I a: - ., 0.75 0.75 0.7 0.7 0 0.1 0.2 0.3 0.4 0.5 0 0.1 0.2 0.3 0.4 0.5 Word error rate Word error rate 1 1 Len=20%, Rand=0.324 Len=2o%, Rand=o_340 0.95 0.95 N I 0.9 0.9 e L’" D 0.85 o 0.85 g 0.8 8 0.8 v a: 0.75 0.75 0.7 0.7 0 0.1 0.2 0.3 0.4 0.5 0 0.1 0.2 0.3 0.4 0.5 Word error rate Word error rate 1 1 Len=30‘V , Rand=0.389 _ _ 0 95 0 095 Len—30%, Rand—0.402 ES N I 0.9 0.9 e L’" D 0.85 o 0.85 8 0.8 8 0.8 a: 0 75 0 75 0.7 0.7 0 0.1 0.2 0.3 0.4 0.5 0 0.1 0.2 0.3 0.4 0.5 Word error rate Word error rate |
Experimental setup | These transcripts contain a word error rate of 12.6%, which is comparable to the best accuracies obtained in the literature on this data set. |
Abstract | Different well-defined approaches have been proposed, but the problem remains far from being solved: best systems achieve a 11% Word Error Rate . |
Abstract | Evaluated in French by 10-fold-cross validation, the system achieves a 9.3% Word Error Rate and a 0.83 BLEU score. |
Evaluation | The system was evaluated in terms of BLEU score (Papineni et al., 2001), Word Error Rate (WER) and Sentence Error Rate (SER). |
Comparison to BabySRL | Error rate Initial .36 Trained .11 Initial (given 2 args) .66 Trained (given 2 args) .13 2008 arg—arg position .65 2008 arg-verb position 0 2009 arg—arg position .82 2009 arg-verb position .63 |
Comparison to BabySRL | The model presented in this paper does not share this restriction, so the raw error rate for this model is presented in the first two lines; the error rate once this additional restriction is imposed is given in the second two lines. |
Comparison to BabySRL | The 1-1 role bias error rate (before training) of the model presented in this paper is comparable to that of Connor et al. |
Introduction | Fraser and Marcu (2007) note that none of the tens of papers published over the last five years has shown that significant decreases in alignment error rate (AER) result in significant increases in translation performance. |
Introduction | After presenting the models and the algorithm in Sections 2 and 3, in Section 4 we examine how the new alignments differ from standard models, and find that the new method consistently improves word alignment performance, measured either as alignment error rate or weighted F—score. |
Word alignment results | (2008) show that alignment error rate (Och and Ney, 2003) can be improved with agreement constraints. |
Experiments | Although we did not examine the accuracy of real tasks in this paper, there is an interesting report that the word error rate of language models follows a power law with respect to perplexity (Klakow and Peters, 2002). |
Experiments | Thus, we conjecture that the word error rate also has a similar tendency as perplexity with respect to the reduced vocabulary size. |
Introduction | Each of these studies experimentally discusses tradeoff relationships between the size of the reduced corpus/model and its performance measured by perplexity, word error rate , and other factors. |
Abstract | We present an adaptive translation quality estimation (QE) method to predict the human-targeted translation error rate (HTER) for a document-specific machine translation model. |
Introduction | In this paper we propose an adaptive quality estimation that predicts sentence-level human-targeted translation error rate (HTER) (Snover et al., 2006) for a document-specific MT post-editing system. |
Static MT Quality Estimation | score or translation error rate of the translated sentences or documents based on a set of features. |
Abstract | Minimum Error Rate Training (MERT) and Minimum Bayes-Risk (MBR) decoding are used in most current state-of—the-art Statistical Machine Translation (SMT) systems. |
Introduction | Two popular techniques that incorporate the error criterion are Minimum Error Rate Training (MERT) (Och, 2003) and Minimum Bayes-Risk (MBR) decoding (Kumar and Byrne, 2004). |
Minimum Bayes-Risk Decoding | This reranking can be done for any sentence-level loss function such as BLEU (Papineni et al., 2001), Word Error Rate, or Position-independent Error Rate . |
Amount of Information in Semantic Roles Inventory | Table 2: Percent Error rate reduction (ERR) across role labelling sets in three tasks in Zapirain et al. |
Amount of Information in Semantic Roles Inventory | (2008) and calculate the reduction in error rate based on this differential baseline for the two annotation schemes. |
Amount of Information in Semantic Roles Inventory | VerbNet has better role generalising ability overall as its reduction in error rate is greater than PropBank (first line of Table 2), but it is more degraded by lack of verb information (second and third lines of Table 2). |
Experimental Results | Table 2: Alignment error rate results for the bidirectional model versus the baseline directional models. |
Experimental Results | First, we measure alignment error rate (AER), which compares the pro- |
Experimental Results | The translation model weights were tuned for both the baseline and bidirectional alignments using lattice-based minimum error rate training (Kumar et al., 2009). |
Abstract | Using a new multimodal dataset consisting of sentiment annotated utterances extracted from video reviews, we show that multimodal sentiment analysis can be effectively performed, and that the joint use of visual, acoustic, and linguistic modalities can lead to error rate reductions of up to 10.5% as compared to the best performing individual modality. |
Conclusions | Our experiments show that sentiment annotation of utterance-level visual datastreams can be effectively performed, and that the use of multiple modalities can lead to error rate reductions of up to 10.5% as compared to the use of one modality at a time. |
Discussion | Compared to the best individual classifier, the relative error rate reduction obtained with the trimodal classifier is 10.5%. |
Experiment | We judged the quality of the decoding by measuring the percentage of characters in the cipher alphabet that were correctly guessed, and also the word error rate of the plaintext generated by our solution. |
Experiment | We did not count the accuracy or word error rate for unfinished ciphers. |
Results | We can also observe that even when there are errors (e.g., in the size 1000 cipher), the word error rate is very small. |
Multitask Quality Estimation 4.1 Experimental Setup | Shown above are the training mean baseline ,a, single-task learning approaches, and multitask learning models, with the columns showing macro average error rates over all three response values. |
Multitask Quality Estimation 4.1 Experimental Setup | Note that here error rates are measured over all of the three annotators’ judgements, and consequently are higher than those measured against their average response in Table 1. |
Multitask Quality Estimation 4.1 Experimental Setup | To test this, we trained single-task, pooled and multitask models on randomly sub-sampled training sets of different sizes, and plot their error rates in Figure 1. |
Abstract | We propose a discriminative ITG pruning framework using Minimum Error Rate Training and various features from previous work on ITG alignment. |
Evaluation | That is, the first evaluation metric, pruning error rate (henceforth PER), measures how many correct E-spans are discarded. |
The DPDI Framework | Parameter training of DPDI is based on Minimum Error Rate Training (MERT) (Och, 2003), a Widely used method in SMT. |
Discussion and Conclusions | First, we can see from the results that several systems appear better when evaluating on a correlation measure like Pearson’s p, while others appear better when analyzing error rate . |
Discussion and Conclusions | Evaluating with a correlative measure yields predictably poor results, but evaluating the error rate indicates that it is comparable to (or better than) the more intelligent BOW metrics. |
Results | However, as the perceptron is designed to minimize error rate , this may not reflect an optimal objective when seeking to detect matches. |
Background | In this work, Minimum Error Rate Training (MERT) proposed by Och (2003) is used to estimate feature weights ,1 over a series of training samples. |
Background | As the weighted BLEU is used to measure the translation accuracy on the training set, the error rate is defined to be: |
Background | The diversity is measured in terms of the Translation Error Rate (TER) metric proposed in (Snover et al., 2006). |