Predicting Grammaticality on an Ordinal Scale
Heilman, Michael and Cahill, Aoife and Madnani, Nitin and Lopez, Melissa and Mulholland, Matthew and Tetreault, Joel

Article Structure

Abstract

Automated methods for identifying whether sentences are grammatical have various potential applications (e.g., machine translation, automated essay scoring, computer-assisted language learning).

Introduction

In this paper, we develop a system for the task of predicting the grammaticality of sentences, and present a dataset of learner sentences rated for grammaticality.

Dataset Description

We created a dataset consisting of 3,129 sentences randomly selected from essays written by nonnative speakers of English as part of a test of English language proficiency.

System Description

This section describes the statistical model (§3.l) and features (§3.2) used by our system.

Experiments

Next, we present evaluations on the GUG dataset.

Discussion and Conclusions

In this paper, we developed a system for predicting grammaticality on an ordinal scale and created a labeled dataset that we have released publicly (§2) to enable more realistic evaluations in future research.

Topics

development set

Appears in 6 sentences as: development set (5) development sets (1)
In Predicting Grammaticality on an Ordinal Scale
  1. On 10 preliminary runs with the development set , this variance
    Page 3, “System Description”
  2. Table l: Pearson’s 7“ on the development set , for our full system and variations excluding each feature type.
    Page 4, “System Description”
  3. For this experiment, all models were estimated from the training set and evaluated on the development set .
    Page 4, “Experiments”
  4. For test set evaluations, we trained on the combination of the training and development sets (§2), to maximize the amount of training data for the final experiments.
    Page 4, “Experiments”
  5. 12We selected a threshold for binarization from a grid of 1001 points from 1 to 4 that maximized the accuracy of binarized predictions from a model trained on the training set and evaluated on the binarized development set .
    Page 4, “Experiments”
  6. For evaluating the three single-feature baselines discussed below, we used the same approach except with grid ranging from the minimum development set feature value to the maximum plus 0.1% of the range.
    Page 4, “Experiments”

See all papers in Proc. ACL 2014 that mention development set.

See all papers in Proc. ACL that mention development set.

Back to top.

language model

Appears in 6 sentences as: Language Model (1) language model (6)
In Predicting Grammaticality on an Ordinal Scale
  1. In this work, we construct a statistical model of grammaticality using various linguistic features (e.g., misspelling counts, parser outputs, n-gram language model scores).
    Page 1, “Abstract”
  2. 3.2.2 n-gram Count and Language Model Features
    Page 3, “System Description”
  3. The model computes the following features from a 5-gram language model trained on the same three sections of English Gigaword using the SRILM toolkit (Stolcke, 2002):
    Page 3, “System Description”
  4. Finally, the system computes the average log-probability and number of out-of-vocabulary words from a language model trained on a collection of essays written by nonnative English speakers7 (“nonnative LM”).
    Page 3, “System Description”
  5. To create further baselines for comparison, we selected the following features that represent ways one might approximate grammaticality if a comprehensive model was unavailable: whether the link parser can fully parse the sentence (complete_l ink), the Gigaword language model score (gigaword_avglogprob), and the number of misspelled tokens (nummisspelled).
    Page 5, “Experiments”
  6. While Post found that such a system can effectively distinguish grammatical news text sentences from sentences generated by a language model, measuring the grammaticality of real sentences from language leam-ers seems to require a wider variety of features, including n-gram counts, language model scores, etc.
    Page 5, “Discussion and Conclusions”

See all papers in Proc. ACL 2014 that mention language model.

See all papers in Proc. ACL that mention language model.

Back to top.

binarized

Appears in 5 sentences as: binarization (1) binarized (5) binarizing (1)
In Predicting Grammaticality on an Ordinal Scale
  1. From an initial prediction y, it produces the final prediction: A AM re not affect Pearson’s 'r correlations or rankings, but it would affect binarized predictions.
    Page 3, “System Description”
  2. We also trained and evaluated on binarized versions of the ordinal GUG labels: a sentence was labeled 1 if the average judgment was at least 3.5 (i.e., would round to 4), and 0 otherwise.
    Page 4, “Experiments”
  3. To train our system on binarized data, we replaced the £2 -regularized linear regression model with an 62-regularized logistic regression and used Kendall’s 7' rank correlation between the predicted probabilities of the positive class and the binary gold standard labels as the grid search metric (§3.1) instead of Pearson’s 7“.
    Page 4, “Experiments”
  4. We also evaluated the binary system for the ordinal task by computing correlations between its estimated probabilities and the averaged human scores, and we evaluated the ordinal system for the binary task by binarizing its predictions.12
    Page 4, “Experiments”
  5. 12We selected a threshold for binarization from a grid of 1001 points from 1 to 4 that maximized the accuracy of binarized predictions from a model trained on the training set and evaluated on the binarized development set.
    Page 4, “Experiments”

See all papers in Proc. ACL 2014 that mention binarized.

See all papers in Proc. ACL that mention binarized.

Back to top.

machine translation

Appears in 4 sentences as: machine translation (4)
In Predicting Grammaticality on an Ordinal Scale
  1. Automated methods for identifying whether sentences are grammatical have various potential applications (e.g., machine translation , automated essay scoring, computer-assisted language learning).
    Page 1, “Abstract”
  2. Such a system could be used, for example, to check or to rank outputs from systems for text summarization, natural language generation, or machine translation .
    Page 1, “Introduction”
  3. While some applications (e.g., grammar checking) rely on such fine-grained predictions, others might be better addressed by sentence-level grammaticality judgments (e. g., machine translation evaluation).
    Page 1, “Introduction”
  4. ity of machine translation outputs (Gamon et al., 2005; Parton et al., 2011), such as the MT Quality Estimation Shared Tasks (Bojar et al., 2013, §6), but relatively little on evaluating the grammaticality of naturally occurring text.
    Page 1, “Introduction”

See all papers in Proc. ACL 2014 that mention machine translation.

See all papers in Proc. ACL that mention machine translation.

Back to top.

n-gram

Appears in 4 sentences as: n-gram (4)
In Predicting Grammaticality on an Ordinal Scale
  1. In this work, we construct a statistical model of grammaticality using various linguistic features (e.g., misspelling counts, parser outputs, n-gram language model scores).
    Page 1, “Abstract”
  2. 3.2.2 n-gram Count and Language Model Features
    Page 3, “System Description”
  3. n-gram frequencies from Gigaword and whether the link parser can fully parse the sentence.
    Page 4, “Experiments”
  4. While Post found that such a system can effectively distinguish grammatical news text sentences from sentences generated by a language model, measuring the grammaticality of real sentences from language leam-ers seems to require a wider variety of features, including n-gram counts, language model scores, etc.
    Page 5, “Discussion and Conclusions”

See all papers in Proc. ACL 2014 that mention n-gram.

See all papers in Proc. ACL that mention n-gram.

Back to top.

sentence-level

Appears in 4 sentences as: sentence-level (4)
In Predicting Grammaticality on an Ordinal Scale
  1. While some applications (e.g., grammar checking) rely on such fine-grained predictions, others might be better addressed by sentence-level grammaticality judgments (e. g., machine translation evaluation).
    Page 1, “Introduction”
  2. Regarding sentence-level grammaticality, there has been much work on rating the grammatical-
    Page 1, “Introduction”
  3. With this unique data set, which we will release to the research community, it is now possible to conduct realistic evaluations for predicting sentence-level grammaticality.
    Page 1, “Introduction”
  4. This is the most realistic evaluation of methods for predicting sentence-level grammaticality to date.
    Page 5, “Discussion and Conclusions”

See all papers in Proc. ACL 2014 that mention sentence-level.

See all papers in Proc. ACL that mention sentence-level.

Back to top.

gold standard

Appears in 3 sentences as: gold standard (3)
In Predicting Grammaticality on an Ordinal Scale
  1. For all experiments reported later, we used this average of six judgments as our gold standard .
    Page 2, “Dataset Description”
  2. It saves the mean and standard deviation of the training data gold standard (Mgold and SDgold, respectively) and of its own predictions on the training data (Mpred and SDpTed, respectively).
    Page 3, “System Description”
  3. To train our system on binarized data, we replaced the £2 -regularized linear regression model with an 62-regularized logistic regression and used Kendall’s 7' rank correlation between the predicted probabilities of the positive class and the binary gold standard labels as the grid search metric (§3.1) instead of Pearson’s 7“.
    Page 4, “Experiments”

See all papers in Proc. ACL 2014 that mention gold standard.

See all papers in Proc. ACL that mention gold standard.

Back to top.

LM

Appears in 3 sentences as: LM (2) LM” (1)
In Predicting Grammaticality on an Ordinal Scale
  1. Finally, the system computes the average log-probability and number of out-of-vocabulary words from a language model trained on a collection of essays written by nonnative English speakers7 (“nonnative LM” ).
    Page 3, “System Description”
  2. 7. our system 0.668 — nonnative LM (§3.2.2) 0.665 — HPSG parse (§3.2.3) 0.664 — PCFG parse (§3.2.4) 0.662
    Page 4, “System Description”
  3. — spelling (§3.2.l) 0.643 — gigaword LM (§3.2.2) 0.638 — link parse (§3.2.3) 0.632
    Page 4, “System Description”

See all papers in Proc. ACL 2014 that mention LM.

See all papers in Proc. ACL that mention LM.

Back to top.

logistic regression

Appears in 3 sentences as: logistic regression (3)
In Predicting Grammaticality on an Ordinal Scale
  1. To train our system on binarized data, we replaced the £2 -regularized linear regression model with an 62-regularized logistic regression and used Kendall’s 7' rank correlation between the predicted probabilities of the positive class and the binary gold standard labels as the grid search metric (§3.1) instead of Pearson’s 7“.
    Page 4, “Experiments”
  2. Therefore, we used the same learning algorithms as for our system (i.e., ridge regression for the ordinal task and logistic regression for the binary task).14
    Page 5, “Experiments”
  3. 14In preliminary experiments, we observed little difference in performance between logistic regression and the original support vector classifier used by the system from Post (201 l).
    Page 5, “Experiments”

See all papers in Proc. ACL 2014 that mention logistic regression.

See all papers in Proc. ACL that mention logistic regression.

Back to top.

model scores

Appears in 3 sentences as: model score (1) model scores (2)
In Predicting Grammaticality on an Ordinal Scale
  1. In this work, we construct a statistical model of grammaticality using various linguistic features (e.g., misspelling counts, parser outputs, n-gram language model scores ).
    Page 1, “Abstract”
  2. To create further baselines for comparison, we selected the following features that represent ways one might approximate grammaticality if a comprehensive model was unavailable: whether the link parser can fully parse the sentence (complete_l ink), the Gigaword language model score (gigaword_avglogprob), and the number of misspelled tokens (nummisspelled).
    Page 5, “Experiments”
  3. While Post found that such a system can effectively distinguish grammatical news text sentences from sentences generated by a language model, measuring the grammaticality of real sentences from language leam-ers seems to require a wider variety of features, including n-gram counts, language model scores , etc.
    Page 5, “Discussion and Conclusions”

See all papers in Proc. ACL 2014 that mention model scores.

See all papers in Proc. ACL that mention model scores.

Back to top.

model trained

Appears in 3 sentences as: model trained (3)
In Predicting Grammaticality on an Ordinal Scale
  1. The model computes the following features from a 5-gram language model trained on the same three sections of English Gigaword using the SRILM toolkit (Stolcke, 2002):
    Page 3, “System Description”
  2. Finally, the system computes the average log-probability and number of out-of-vocabulary words from a language model trained on a collection of essays written by nonnative English speakers7 (“nonnative LM”).
    Page 3, “System Description”
  3. 12We selected a threshold for binarization from a grid of 1001 points from 1 to 4 that maximized the accuracy of binarized predictions from a model trained on the training set and evaluated on the binarized development set.
    Page 4, “Experiments”

See all papers in Proc. ACL 2014 that mention model trained.

See all papers in Proc. ACL that mention model trained.

Back to top.