That's Not What I Meant! Using Parsers to Avoid Structural Ambiguities in Generated Text
Duan, Manjuan and White, Michael

Article Structure

Abstract

We investigate whether parsers can be used for self-monitoring in surface realization in order to avoid egregious errors involving “vicious” ambiguities, namely those where the intended interpretation fails to be considerably more likely than alternative ones.

Introduction

Rajkumar & White (2011; 2012) have recently shown that some rather egregious surface realization errors—in the sense that the reader would likely end up with the wrong interpretation—can be avoided by making use of features inspired by psycholinguistics research together with an otherwise state-of-the-art averaged perceptron realization ranking model (White and Rajkumar, 2009), as reviewed in the next section.

Background

We use the OpenCCG1 surface realizer for the experiments reported in this paper.

Simple Reranking

3.1 Methods

Reranking with SVMs 4.1 Methods

Since different parsers make different errors, we conjectured that dependencies in the intersection of the output of multiple parsers may be more reliable and thus may more reliably reflect human comprehension preferences.

Analysis and Discussion

5.1 Targeted Manual Analysis

Related Work

Approaches to surface realization have been developed for LFG, HPSG, and TAG, in addition to CCG, and recently statistical dependency-based approaches have been developed as well; see the report from the first surface realization shared

Conclusion

In this paper, we have shown that while using parse accuracy in a simple reranking strategy for self-monitoring fails to improve BLEU scores over a state-of-the-art averaged perceptron realization ranking model, it is possible to significantly increase BLEU scores using an SVM ranker that combines the realizer’s model score together with features from multiple parsers, including ones designed to make the ranker more robust to parsing mistakes that human readers would be unlikely to make.

Topics

perceptron

Appears in 29 sentences as: perceptron (30)
In That's Not What I Meant! Using Parsers to Avoid Structural Ambiguities in Generated Text
  1. Using parse accuracy in a simple reranking strategy for self-monitoring, we find that with a state-of-the-art averaged perceptron realization ranking model, BLEU scores cannot be improved with any of the well-known Treebank parsers we tested, since these parsers too often make errors that human readers would be unlikely to make.
    Page 1, “Abstract”
  2. Rajkumar & White (2011; 2012) have recently shown that some rather egregious surface realization errors—in the sense that the reader would likely end up with the wrong interpretation—can be avoided by making use of features inspired by psycholinguistics research together with an otherwise state-of-the-art averaged perceptron realization ranking model (White and Rajkumar, 2009), as reviewed in the next section.
    Page 1, “Introduction”
  3. With this simple reranking strategy and each of three different Treebank parsers, we find that it is possible to improve BLEU scores on Penn Treebank development data with White & Rajkumar’s (2011; 2012) baseline generative model, but not with their averaged perceptron model.
    Page 2, “Introduction”
  4. Therefore, to develop a more nuanced self-monitoring reranker that is more robust to such parsing mistakes, we trained an SVM using dependency precision and recall features for all three parses, their n-best parsing results, and per-label precision and recall for each type of dependency, together with the realizer’s normalized perceptron model score as a feature.
    Page 2, “Introduction”
  5. White & Rajkumar’s averaged perceptron model on both development and test data.
    Page 2, “Introduction”
  6. Using the averaged perceptron algorithm (Collins, 2002), White & Rajkumar (2009) trained a structured prediction ranking model to combine these existing syntactic models with several n-gram language models.
    Page 3, “Background”
  7. The first one is the baseline generative model (hereafter, generative model) used in training the averaged perceptron model.
    Page 4, “Simple Reranking”
  8. The second one is the averaged perceptron model (hereafter, perceptron model), which uses all the features reviewed in Section 2.
    Page 4, “Simple Reranking”
  9. Table 2: Devset BLEU scores for simple ranking on top of n-best perceptron model realizations
    Page 5, “Simple Reranking”
  10. (c) perceptron best
    Page 5, “Simple Reranking”
  11. Simple ranking with the Berkeley parser of the generative model’s n-best realizations raised the BLEU score from 85.55 to 86.07, well below the averaged perceptron model’s BLEU score of 87.93.
    Page 5, “Simple Reranking”

See all papers in Proc. ACL 2014 that mention perceptron.

See all papers in Proc. ACL that mention perceptron.

Back to top.

SVM

Appears in 23 sentences as: SVM (24)
In That's Not What I Meant! Using Parsers to Avoid Structural Ambiguities in Generated Text
  1. However, by using an SVM ranker to combine the realizer’s model score together with features from multiple parsers, including ones designed to make the ranker more robust to parsing mistakes, we show that significant increases in BLEU scores can be achieved.
    Page 1, “Abstract”
  2. Moreover, via a targeted manual analysis, we demonstrate that the SVM reranker frequently manages to avoid vicious ambiguities, while its ranking errors tend to affect fluency much more often than adequacy.
    Page 1, “Abstract”
  3. Consequently, we examine two reranking strategies, one a simple baseline approach and the other using an SVM reranker (J oachims, 2002).
    Page 2, “Introduction”
  4. Therefore, to develop a more nuanced self-monitoring reranker that is more robust to such parsing mistakes, we trained an SVM using dependency precision and recall features for all three parses, their n-best parsing results, and per-label precision and recall for each type of dependency, together with the realizer’s normalized perceptron model score as a feature.
    Page 2, “Introduction”
  5. With the SVM reranker, we obtain a significant improvement in BLEU scores over
    Page 2, “Introduction”
  6. Additionally, in a targeted manual analysis, we find that in cases where the SVM reranker improves the BLEU score, improvements to fluency and adequacy are roughly balanced, while in cases where the BLEU score goes down, it is mostly fluency that is made worse (with reranking yielding an acceptable paraphrase roughly one third of the time in both cases).
    Page 2, “Introduction”
  7. In Section 4, we describe how we trained an SVM reranker and report our results using BLEU scores (Papineni et al., 2002).
    Page 2, “Introduction”
  8. Similarly, we conjectured that large differences in the realizer’s perceptron model score may more reliably reflect human fluency preferences than small ones, and thus we combined this score with features for parser accuracy in an SVM ranker.
    Page 5, “Reranking with SVMs 4.1 Methods”
  9. Additionally, given that parsers may more reliably recover some kinds of dependencies than others, we included features for each dependency type, so that the SVM ranker might learn how to weight them appropriately.
    Page 5, “Reranking with SVMs 4.1 Methods”
  10. We trained the SVM ranker (J oachims, 2002) with a linear kernel and chose the hyper-parameter c, which tunes the tradeoff between training error and margin, with 6-fold cross-validation on the devset.
    Page 6, “Reranking with SVMs 4.1 Methods”
  11. Table 3 shows the results of different SVM ranking models on the devset.
    Page 6, “Reranking with SVMs 4.1 Methods”

See all papers in Proc. ACL 2014 that mention SVM.

See all papers in Proc. ACL that mention SVM.

Back to top.

BLEU

Appears in 20 sentences as: BLEU (24)
In That's Not What I Meant! Using Parsers to Avoid Structural Ambiguities in Generated Text
  1. Using parse accuracy in a simple reranking strategy for self-monitoring, we find that with a state-of-the-art averaged perceptron realization ranking model, BLEU scores cannot be improved with any of the well-known Treebank parsers we tested, since these parsers too often make errors that human readers would be unlikely to make.
    Page 1, “Abstract”
  2. However, by using an SVM ranker to combine the realizer’s model score together with features from multiple parsers, including ones designed to make the ranker more robust to parsing mistakes, we show that significant increases in BLEU scores can be achieved.
    Page 1, “Abstract”
  3. With this simple reranking strategy and each of three different Treebank parsers, we find that it is possible to improve BLEU scores on Penn Treebank development data with White & Rajkumar’s (2011; 2012) baseline generative model, but not with their averaged perceptron model.
    Page 2, “Introduction”
  4. With the SVM reranker, we obtain a significant improvement in BLEU scores over
    Page 2, “Introduction”
  5. Additionally, in a targeted manual analysis, we find that in cases where the SVM reranker improves the BLEU score, improvements to fluency and adequacy are roughly balanced, while in cases where the BLEU score goes down, it is mostly fluency that is made worse (with reranking yielding an acceptable paraphrase roughly one third of the time in both cases).
    Page 2, “Introduction”
  6. In Section 4, we describe how we trained an SVM reranker and report our results using BLEU scores (Papineni et al., 2002).
    Page 2, “Introduction”
  7. In Section 5, we present a targeted manual analysis of the development set sentences with the greatest change in BLEU scores, discussing both successes and errors.
    Page 2, “Introduction”
  8. Table 2: Devset BLEU scores for simple ranking on top of n-best perceptron model realizations
    Page 5, “Simple Reranking”
  9. Simple ranking with the Berkeley parser of the generative model’s n-best realizations raised the BLEU score from 85.55 to 86.07, well below the averaged perceptron model’s BLEU score of 87.93.
    Page 5, “Simple Reranking”
  10. In sum, although simple ranking helps to avoid vicious ambiguity in some cases, the overall results of simple ranking are no better than the perceptron model (according to BLEU , at least), as parse failures that are not reflective of human in-tepretive tendencies too often lead the ranker to choose dispreferred realizations.
    Page 5, “Simple Reranking”
  11. In training, we used the BLEU scores of each realization compared with its reference sentence to establish a preference order over pairs of candidate realizations, assuming that the original corpus sentences are generally better than related alternatives, and that BLEU can somewhat reliably predict human preference judgments.
    Page 6, “Reranking with SVMs 4.1 Methods”

See all papers in Proc. ACL 2014 that mention BLEU.

See all papers in Proc. ACL that mention BLEU.

Back to top.

BLEU scores

Appears in 16 sentences as: BLEU score (7) BLEU scores (12)
In That's Not What I Meant! Using Parsers to Avoid Structural Ambiguities in Generated Text
  1. Using parse accuracy in a simple reranking strategy for self-monitoring, we find that with a state-of-the-art averaged perceptron realization ranking model, BLEU scores cannot be improved with any of the well-known Treebank parsers we tested, since these parsers too often make errors that human readers would be unlikely to make.
    Page 1, “Abstract”
  2. However, by using an SVM ranker to combine the realizer’s model score together with features from multiple parsers, including ones designed to make the ranker more robust to parsing mistakes, we show that significant increases in BLEU scores can be achieved.
    Page 1, “Abstract”
  3. With this simple reranking strategy and each of three different Treebank parsers, we find that it is possible to improve BLEU scores on Penn Treebank development data with White & Rajkumar’s (2011; 2012) baseline generative model, but not with their averaged perceptron model.
    Page 2, “Introduction”
  4. With the SVM reranker, we obtain a significant improvement in BLEU scores over
    Page 2, “Introduction”
  5. Additionally, in a targeted manual analysis, we find that in cases where the SVM reranker improves the BLEU score, improvements to fluency and adequacy are roughly balanced, while in cases where the BLEU score goes down, it is mostly fluency that is made worse (with reranking yielding an acceptable paraphrase roughly one third of the time in both cases).
    Page 2, “Introduction”
  6. In Section 4, we describe how we trained an SVM reranker and report our results using BLEU scores (Papineni et al., 2002).
    Page 2, “Introduction”
  7. In Section 5, we present a targeted manual analysis of the development set sentences with the greatest change in BLEU scores , discussing both successes and errors.
    Page 2, “Introduction”
  8. Table 2: Devset BLEU scores for simple ranking on top of n-best perceptron model realizations
    Page 5, “Simple Reranking”
  9. Simple ranking with the Berkeley parser of the generative model’s n-best realizations raised the BLEU score from 85.55 to 86.07, well below the averaged perceptron model’s BLEU score of 87.93.
    Page 5, “Simple Reranking”
  10. In training, we used the BLEU scores of each realization compared with its reference sentence to establish a preference order over pairs of candidate realizations, assuming that the original corpus sentences are generally better than related alternatives, and that BLEU can somewhat reliably predict human preference judgments.
    Page 6, “Reranking with SVMs 4.1 Methods”
  11. The complete model, BBS+dep+nbest, achieved a BLEU score of 88.73, significantly improving upon the perceptron model (p < 0.02).
    Page 7, “Reranking with SVMs 4.1 Methods”

See all papers in Proc. ACL 2014 that mention BLEU scores.

See all papers in Proc. ACL that mention BLEU scores.

Back to top.

reranking

Appears in 16 sentences as: rerank (2) reranked (2) reranker (7) reranking (9)
In That's Not What I Meant! Using Parsers to Avoid Structural Ambiguities in Generated Text
  1. Using parse accuracy in a simple reranking strategy for self-monitoring, we find that with a state-of-the-art averaged perceptron realization ranking model, BLEU scores cannot be improved with any of the well-known Treebank parsers we tested, since these parsers too often make errors that human readers would be unlikely to make.
    Page 1, “Abstract”
  2. Moreover, via a targeted manual analysis, we demonstrate that the SVM reranker frequently manages to avoid vicious ambiguities, while its ranking errors tend to affect fluency much more often than adequacy.
    Page 1, “Abstract”
  3. To do so—in a nutshell—we enumerate an n-best list of realizations and rerank them if necessary to avoid vicious ambiguities, as determined by one or more automatic parsers.
    Page 2, “Introduction”
  4. Consequently, we examine two reranking strategies, one a simple baseline approach and the other using an SVM reranker (J oachims, 2002).
    Page 2, “Introduction”
  5. Our simple reranking strategy for self-monitoring is to rerank the realizer’s n-best list by parse accuracy, preserving the original order in case of ties.
    Page 2, “Introduction”
  6. With this simple reranking strategy and each of three different Treebank parsers, we find that it is possible to improve BLEU scores on Penn Treebank development data with White & Rajkumar’s (2011; 2012) baseline generative model, but not with their averaged perceptron model.
    Page 2, “Introduction”
  7. In inspecting the results of reranking with this strategy, we observe that while it does sometimes succeed in avoiding egregious errors involving vicious ambiguities, common parsing mistakes such as PP-attachment errors lead to unnecessarily sacrificing conciseness or fluency in order to avoid ambiguities that would be easily tolerated by human readers.
    Page 2, “Introduction”
  8. Therefore, to develop a more nuanced self-monitoring reranker that is more robust to such parsing mistakes, we trained an SVM using dependency precision and recall features for all three parses, their n-best parsing results, and per-label precision and recall for each type of dependency, together with the realizer’s normalized perceptron model score as a feature.
    Page 2, “Introduction”
  9. With the SVM reranker , we obtain a significant improvement in BLEU scores over
    Page 2, “Introduction”
  10. Additionally, in a targeted manual analysis, we find that in cases where the SVM reranker improves the BLEU score, improvements to fluency and adequacy are roughly balanced, while in cases where the BLEU score goes down, it is mostly fluency that is made worse (with reranking yielding an acceptable paraphrase roughly one third of the time in both cases).
    Page 2, “Introduction”
  11. In Section 3, we report on our experiments with the simple reranking strategy, including a discussion of the ways in which this method typically fails.
    Page 2, “Introduction”

See all papers in Proc. ACL 2014 that mention reranking.

See all papers in Proc. ACL that mention reranking.

Back to top.

precision and recall

Appears in 7 sentences as: precision and recall (11)
In That's Not What I Meant! Using Parsers to Avoid Structural Ambiguities in Generated Text
  1. Therefore, to develop a more nuanced self-monitoring reranker that is more robust to such parsing mistakes, we trained an SVM using dependency precision and recall features for all three parses, their n-best parsing results, and per-label precision and recall for each type of dependency, together with the realizer’s normalized perceptron model score as a feature.
    Page 2, “Introduction”
  2. precision and recall labeled and unlabeled precision and recall for each parser’s best parse
    Page 6, “Reranking with SVMs 4.1 Methods”
  3. per-label precision and recall (dep) precision and recall for each type of dependency obtained from each parser’s best parse (using zero if not defined for lack of predicted or gold dependencies with a given label)
    Page 6, “Reranking with SVMs 4.1 Methods”
  4. n-best precision and recall (nbest) labeled and unlabeled precision and recall for each parser’s top five parses, along with the same features for the most accurate of these parses
    Page 6, “Reranking with SVMs 4.1 Methods”
  5. For each parser, we trained a model with its overall precision and recall features, as shown at the top of Table 3.
    Page 6, “Reranking with SVMs 4.1 Methods”
  6. Next, to this combined model we separately added (i) the per-label precision and recall features from all the parsers (BBS+dep), and (ii) the n-best features from the parsers (BBS+nbest).
    Page 6, “Reranking with SVMs 4.1 Methods”
  7. Somewhat surprisingly, the Berkeley parser did as well as all three parsers using just the overall precision and recall features, but not quite as well using all features.
    Page 7, “Reranking with SVMs 4.1 Methods”

See all papers in Proc. ACL 2014 that mention precision and recall.

See all papers in Proc. ACL that mention precision and recall.

Back to top.

model score

Appears in 6 sentences as: model score (6)
In That's Not What I Meant! Using Parsers to Avoid Structural Ambiguities in Generated Text
  1. However, by using an SVM ranker to combine the realizer’s model score together with features from multiple parsers, including ones designed to make the ranker more robust to parsing mistakes, we show that significant increases in BLEU scores can be achieved.
    Page 1, “Abstract”
  2. Therefore, to develop a more nuanced self-monitoring reranker that is more robust to such parsing mistakes, we trained an SVM using dependency precision and recall features for all three parses, their n-best parsing results, and per-label precision and recall for each type of dependency, together with the realizer’s normalized perceptron model score as a feature.
    Page 2, “Introduction”
  3. Similarly, we conjectured that large differences in the realizer’s perceptron model score may more reliably reflect human fluency preferences than small ones, and thus we combined this score with features for parser accuracy in an SVM ranker.
    Page 5, “Reranking with SVMs 4.1 Methods”
  4. perceptron model score the score from the realizer’s model, normalized to [0,1] for the realizations in the n-best list
    Page 6, “Reranking with SVMs 4.1 Methods”
  5. We trained different models to investigate the contribution made by different parsers and different types of features, with the perceptron model score included as a feature in all models.
    Page 6, “Reranking with SVMs 4.1 Methods”
  6. In this paper, we have shown that while using parse accuracy in a simple reranking strategy for self-monitoring fails to improve BLEU scores over a state-of-the-art averaged perceptron realization ranking model, it is possible to significantly increase BLEU scores using an SVM ranker that combines the realizer’s model score together with features from multiple parsers, including ones designed to make the ranker more robust to parsing mistakes that human readers would be unlikely to make.
    Page 9, “Conclusion”

See all papers in Proc. ACL 2014 that mention model score.

See all papers in Proc. ACL that mention model score.

Back to top.

Berkeley parser

Appears in 5 sentences as: Berkeley parser (6)
In That's Not What I Meant! Using Parsers to Avoid Structural Ambiguities in Generated Text
  1. We chose the Berkeley parser (Petrov et al., 2006), Brown parser (Chamiak and Johnson, 2005) and Stanford parser (Klein and Manning, 2003) to parse the realizations generated by the
    Page 4, “Simple Reranking”
  2. Simple ranking with the Berkeley parser of the generative model’s n-best realizations raised the BLEU score from 85.55 to 86.07, well below the averaged perceptron model’s BLEU score of 87.93.
    Page 5, “Simple Reranking”
  3. Finally, since the Berkeley parser yielded the best results on its own, we also tested models using all the feature classes but only using this parser by itself.
    Page 6, “Reranking with SVMs 4.1 Methods”
  4. Somewhat surprisingly, the Berkeley parser did as well as all three parsers using just the overall precision and recall features, but not quite as well using all features.
    Page 7, “Reranking with SVMs 4.1 Methods”
  5. Given that with the more refined SVM ranker, the Berkeley parser worked nearly as well as all three parsers together using the complete feature set, the prospects for future work on a more realistic scenario using the OpenCCG parser in an SVM ranker for self-monitoring now appear much more promising, either using OpenCCG’s reimplemen-tation of Hockenmaier & Steedman’s generative CCG model, or using the Berkeley parser trained on OpenCCG’s enhanced version of the CCGbank, along the lines of Fowler and Penn (2010).
    Page 8, “Analysis and Discussion”

See all papers in Proc. ACL 2014 that mention Berkeley parser.

See all papers in Proc. ACL that mention Berkeley parser.

Back to top.

significant improvement

Appears in 5 sentences as: significant improvement (2) significant improvements (2) significantly improving (1)
In That's Not What I Meant! Using Parsers to Avoid Structural Ambiguities in Generated Text
  1. With the SVM reranker, we obtain a significant improvement in BLEU scores over
    Page 2, “Introduction”
  2. To improve word ordering decisions, White & Rajkumar (2012) demonstrated that incorporating a feature into the ranker inspired by Gibson’s (2000) dependency locality theory can deliver statistically significant improvements in automatic evaluation scores, better match the distributional characteristics of sentence orderings, and significantly reduce the number of serious ordering errors (some involving vicious ambiguities) as confirmed by a targeted human evaluation.
    Page 3, “Background”
  3. However, as shown in Table 2, none of the parsers yielded significant improvements on the top of the perceptron model.
    Page 5, “Simple Reranking”
  4. tures and the n-best parse features contributed to achieving a significant improvement compared to the perceptron model.
    Page 7, “Reranking with SVMs 4.1 Methods”
  5. The complete model, BBS+dep+nbest, achieved a BLEU score of 88.73, significantly improving upon the perceptron model (p < 0.02).
    Page 7, “Reranking with SVMs 4.1 Methods”

See all papers in Proc. ACL 2014 that mention significant improvement.

See all papers in Proc. ACL that mention significant improvement.

Back to top.

CCG

Appears in 4 sentences as: CCG (4)
In That's Not What I Meant! Using Parsers to Avoid Structural Ambiguities in Generated Text
  1. In the figure, nodes correspond to discourse referents labeled with lexical predicates, and dependency relations between nodes encode argument structure (gold standard CCG lexical categories are also shown); note that semantically empty function words such as infinitival-to are missing.
    Page 2, “Background”
  2. The model takes as its starting point two probabilistic models of syntax that have been developed for CCG parsing, Hockenmaier & Steed-man’s (2002) generative model and Clark & Cur-ran’s (2007) normal-form model.
    Page 3, “Background”
  3. Given that with the more refined SVM ranker, the Berkeley parser worked nearly as well as all three parsers together using the complete feature set, the prospects for future work on a more realistic scenario using the OpenCCG parser in an SVM ranker for self-monitoring now appear much more promising, either using OpenCCG’s reimplemen-tation of Hockenmaier & Steedman’s generative CCG model, or using the Berkeley parser trained on OpenCCG’s enhanced version of the CCGbank, along the lines of Fowler and Penn (2010).
    Page 8, “Analysis and Discussion”
  4. Approaches to surface realization have been developed for LFG, HPSG, and TAG, in addition to CCG , and recently statistical dependency-based approaches have been developed as well; see the report from the first surface realization shared
    Page 8, “Related Work”

See all papers in Proc. ACL 2014 that mention CCG.

See all papers in Proc. ACL that mention CCG.

Back to top.

generative model

Appears in 4 sentences as: generative model (4) generative model’s (1)
In That's Not What I Meant! Using Parsers to Avoid Structural Ambiguities in Generated Text
  1. With this simple reranking strategy and each of three different Treebank parsers, we find that it is possible to improve BLEU scores on Penn Treebank development data with White & Rajkumar’s (2011; 2012) baseline generative model , but not with their averaged perceptron model.
    Page 2, “Introduction”
  2. The model takes as its starting point two probabilistic models of syntax that have been developed for CCG parsing, Hockenmaier & Steed-man’s (2002) generative model and Clark & Cur-ran’s (2007) normal-form model.
    Page 3, “Background”
  3. The first one is the baseline generative model (hereafter, generative model ) used in training the averaged perceptron model.
    Page 4, “Simple Reranking”
  4. Simple ranking with the Berkeley parser of the generative model’s n-best realizations raised the BLEU score from 85.55 to 86.07, well below the averaged perceptron model’s BLEU score of 87.93.
    Page 5, “Simple Reranking”

See all papers in Proc. ACL 2014 that mention generative model.

See all papers in Proc. ACL that mention generative model.

Back to top.

Treebank

Appears in 4 sentences as: Treebank (5)
In That's Not What I Meant! Using Parsers to Avoid Structural Ambiguities in Generated Text
  1. Using parse accuracy in a simple reranking strategy for self-monitoring, we find that with a state-of-the-art averaged perceptron realization ranking model, BLEU scores cannot be improved with any of the well-known Treebank parsers we tested, since these parsers too often make errors that human readers would be unlikely to make.
    Page 1, “Abstract”
  2. With this simple reranking strategy and each of three different Treebank parsers, we find that it is possible to improve BLEU scores on Penn Treebank development data with White & Rajkumar’s (2011; 2012) baseline generative model, but not with their averaged perceptron model.
    Page 2, “Introduction”
  3. We ran two OpenCCG surface realization models on the CCGbank dev set (derived from Section 00 of the Penn Treebank ) and obtained n-best (n = 10) realizations.
    Page 4, “Simple Reranking”
  4. A limitation of the experiments reported in this paper is that OpenCCG’s input semantic dependency graphs are not the same as the Stanford dependencies used with the Treebank parsers, and thus we have had to rely on the gold parses in the PTB to derive gold dependencies for measuring accuracy of parser dependency recovery.
    Page 8, “Analysis and Discussion”

See all papers in Proc. ACL 2014 that mention Treebank.

See all papers in Proc. ACL that mention Treebank.

Back to top.

word order

Appears in 3 sentences as: word order (2) word ordering (1)
In That's Not What I Meant! Using Parsers to Avoid Structural Ambiguities in Generated Text
  1. This model improved upon the state-of-the-art in terms of automatic evaluation scores on held-out test data, but nevertheless an error analysis revealed a surprising number of word order , function word and inflection errors.
    Page 3, “Background”
  2. To improve word ordering decisions, White & Rajkumar (2012) demonstrated that incorporating a feature into the ranker inspired by Gibson’s (2000) dependency locality theory can deliver statistically significant improvements in automatic evaluation scores, better match the distributional characteristics of sentence orderings, and significantly reduce the number of serious ordering errors (some involving vicious ambiguities) as confirmed by a targeted human evaluation.
    Page 3, “Background”
  3. Using dependencies allowed us to measure parse accuracy independently of word order .
    Page 4, “Simple Reranking”

See all papers in Proc. ACL 2014 that mention word order.

See all papers in Proc. ACL that mention word order.

Back to top.